Design and development of information processing system?

D

Dehumanizer2012-01-11 01:07:44

Image processing

Dehumanizer, 2012-01-11 01:07:44

Hello everyone,
I need to write a fairly complex information processing application. In a few words, there is a set of documents of some area, let's say - medical books/documents/etc. The application should “understand” the entire set of documents as much as possible so that users can quickly and efficiently search the database without having knowledge in the specified area. It sounds like a search engine, maybe it is, but the application is "desktop" and the main problem is database design. Architectural advice welcome.
The fact is that I myself am a programmer, but I feel insecure in designing large applications and / or intelligent systems. I apologize if the question is inappropriate.
Any comments, advice and "solutions" are welcome. Forgive me not to provide links to books on design patterns or Fowler, it’s also not worth searching, although if there is a book or resource that will help a lot, I will be very happy for help.
I do not ask how to write an application, it's still my concern), but I myself do not have direct access to specialists and cannot ask questions to either seniors or other programmers. I think Habr is well suited for this, since there are a lot of professional developers here.
Thank you very much in advance.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

E

egorinsk, 2012-01-12
@Dehumanizer

Why design your own, bad search (because it's not so easy to find specialists who can do a good one, they all work on Google for a long time), when it might be possible, for example, to buy a Yandex server for a local network? Or is it not for you?
Look, for example, you are writing a search for medical books. There, often the same term, firstly, can be written differently, in Latin or in Russian, through the letter e or e, etc. Plus, some authors call a thing one word, and others another. And it is unlikely that you will make a normal search with such things in mind.
Or word breaking. Many indexing systems break the text into words. Should the text be broken by the minus sign? will there be a word broken by a hyphen: "ordinary"? Will the word RMT-2600 be broken into pieces? And will the search for it work? Will there be misspelled words? "ordinary" and "ordinary"? And what about the search for numbers (codes), if in the text it is in the form 3-123-124, and the user enters without hyphens?
A separate song is the search for surnames. Johansonn, Johanson, Johanson, Johansonn. Handling unicode characters like ́a?
Will you be able to index for example DOC, PDF and other formats that are used.
And to make a search that looks for exact occurrences, breaking the text by spaces, lowercase and doing primitive stemming, well, anyone can do it, but I'm not sure that it will give accurate results.
Also, there is a ranking problem. The search word is found in 300 documents. Naturally, no one will watch all 300, they will watch the first few dozen, the question is how to rank these 300 documents so that the more relevant ones come first?
If there are several words in the query, do I need to search for each word separately? Or is the presence of all words at a certain distance mandatory?
Look how many difficult-to-solve problems pop up after 10 minutes of thinking.
I would understand if your task is to cut some kind of budget, but you write “so that users can quickly and efficiently search the database without having knowledge in the specified area” - such a search simply cannot be done. You need to make different options, study feedback, save unsuccessful searches, do experiments, tests, etc.
And what they write above about LIKE% and Windows Search causes nothing at all but disappointment.
And you do not need any knowledge of architecture and databases at this stage. Obviously, there must be some component that extracts structured (split into fields) text from source documents, a component that indexes them, a component that searches by the created index, and a component that displays the found documents in a convenient form, highlighting the words in them (if possible). ).
The fact that your question does not mention the words "index", "ranking" and so on, alarms me.
By the way, you can see how the sphinx search system works, but I doubt that the sphinx itself is capable of solving the problems I have described. It is more likely to be able to fulfill the (average) role of the indexing and searching components.

S

Solver, 2012-01-11
@solver

Weird. Need info on designing an information system, but don't bring Fowler...
He will just help design the system.
In general, you need to read about the Knowledge Base. Most likely this is what you need.

P

PavelN, 2012-01-11
@PavelN

From what immediately comes to mind:
0. select what properties (metadata) the document will have. Those. in addition to the title and author, there should be, for example, type (article, book...), level of difficulty, headings or topics to which it belongs, language, year of publication, current / not relevant, who / when added to the system, content, short description
1. Think over the interface, because It seems to me that in this task you will “dance” precisely from this. In general, in this task, I advise you to go, for example, to google and try some document on some specific area and ask users to do it. And to “interrogate” what they did, what they didn’t, and what they would like to see and how to look for.
2. Provide user roles/groups: user, administrator, expert,…
3. For indexing, you can/need only queries like Like '%', but also standard components like Windows Search (there is an API, but I didn't use it) or Full Text Search from SQL Server (if the database is located in it)
4. I think it is worth considering some things in the architecture/interface: tag cloud, most frequently used queries, best links (edited by the
user ), "like" marks for documents,
favorites Need to create a list like “where to start” (by subject area), search tips + documentation
7. Word substitution mechanisms
8. need admin tools - for example, what people are looking for, how, how many results will come up, what documents are viewed
9. We need a feedback mechanism. For example, a very important document is somewhere at the end of the search, and the user (expert) prompts the administrator to "raise" the document