A
A
Arthur2016-01-28 15:31:03
Java
Arthur, 2016-01-28 15:31:03

How to make an index on a given text, and how to search for this index then?

Greetings.
I ask you to help in solving the problem (problem). Point the direction, help with wise advice.
There is a task:
- there is a list of correct program names (a la template, sample, standard). Names consist of 1-6 words
- there is a list of program names that employees wrote. Employees did not write according to the template, made mistakes in the spelling of the words themselves, called the programs something else. At the same time, the essence of the name was still preserved.
For example:
the standard "Programmule plus"
an employee wrote "Program for Documents Programmule"
As you can see, there is a lot of superfluous, but the essence is preserved.
I want to create an index from the directory of reference names and somehow look for matches with a user request in this index (either there is a match, or there is no user program in the dictionary).
I suppose that we are looking for the source lines by individual words and looking for intersections of matches. Or not?
Please tell me how an index is created, and how this index is searched (algorithm).
I don't know much about the subject base, so I ask you to explain in simple words.
I will write in Java.
Thanks for the answers :)
UPD:
I want to share with you how I made the index.
Your answers and comments helped me.

  1. took the line. the string was of the form original_Name : alternate_name_1, alternate_name_2
  2. memorized the original name in a separate variable origName.
  3. cleared it of rubbish. This included the removal of punctuation marks, where I had links in the line, I kept a dash "-". Removed "stop" words. I chose the stop words through observations in the available dictionary. He threw out all the words, in addition to prepositions and conjunctions, which did not carry a semantic load. For me, these were words like "portal", "online", common abbreviations "ac", "abs".
  4. split the string into keywords. The entire line, including the original name.
  5. saved the keyWord : origName binding in the HashMap. Somewhere it was called the backlink method.
  6. if the keywords were repeated, then added other words to the origName in the string.

In the resulting "directory" I looked for occurrences of the keywords from the query. At the output, I received a list with all origName found.
The ranking was that I counted the highest number of matches and considered these matches as the result. Sometimes there were multiple matches with the same number of occurrences. The coincidences were similar in meaning, but introduced some confusion. (I solved this problem with a dialog box with the "operator" of the program)
In general, the search turned out to be acceptable.
I did not write for the mass of people, but for a specific organization, so I accepted the fact that the number of errors when setting a search query would be minimal, and that the remaining words (written correctly) would give the correct result.
All the same, there are a number of errors, it is necessary to refine, introduce better algorithms.
But my bike was a success (which I am very happy about).
I will continue to upgrade the algorithms, one of these days I am waiting for a bunch of data from users to run the program in practice :)
I want to say thank you to those who responded to my questions and gave comments.
This is
  • sirs @sirs
  • xmoonlight @xmoonlight
  • Walt Disney @ruFelix

Answer the question

In order to leave comments, you need to log in

3 answer(s)
S
sirs, 2016-01-28
@antoart

Then for a quick start, I suggest you use HashMap. As keys, use the keywords from the name of the programs, for example:

Map<String, List<Software>> dictionary = new HashMap();
        List<Software> list = new ArrayList<Software>();
        programs.add(new Software("Программуля плюс"));
        programs.add(new Software("Программуля для детей"));
        programs.add(new Software("Автобусы. Программуля"));
        dictionary.put("Программуля", list);

etc.
Divide all your reference program names into separate words, declare each such word as keywords. Here you need to introduce restrictions on such words, for example, at least 3 letters and only letters, etc. - here you have to decide for yourself. Next, make lists of programs in which such keywords occur, check if such a substring is in the string - str1.toLowerCase().contains(str2.toLowerCase()).
Search in the received dictionary simply by the key dictionary.get("Program"); - will return you a list of programs in which the searched substring occurs. Here, too, before polling, you can slightly optimize, for example, bring it to one register, remove spaces at the beginning and end, etc.
This is the simplest implementation. It will not take into account "similar" words, only a substring match in the string.
Do everything on the interfaces, in the process, follow the links above, gain knowledge and make a cooler implementation, leaving the old interface, replacing only the implementation.

X
xmoonlight, 2016-02-06
@xmoonlight

How to determine the similarity of two strings?

R
Roman_Kh, 2016-01-28
@Roman_Kh

You need a fuzzy search. An initial excursion into the topic can be obtained here and here .

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question