Answer the question
In order to leave comments, you need to log in
Search and selection of words in the text (Algorithm)
There is a database of people, each entry can contain an unlimited number of synonyms (names written in different languages, nicknames, etc.). All entries (both synonyms and titles) are absolutely unique. Based on this data, an index table is built (let's call it source) which can be represented as follows:
name VARCHAR(128),
person_id INT(10),
UNIQUE KEY name (name)
Answer the question
In order to leave comments, you need to log in
To begin with, you need a separate function or utility stem (), which for any word will return its stem (this is exactly what search engines do, including Sphinx and Lucene, from where you can get it and use it for your own purposes). Also, the tokenize () function does not interfere, breaking
If there is such a function. We go through all the words from the dictionary (source), and from the resulting bases we build a trie in which the corresponding person_id will be the leaves. Names of 2 or more words are translated into n-grams of the corresponding stems, separated by spaces. Those. the name “Anna Mikhailovna Ivanova” will be translated into something like “Anna Mikhailovna Ivanov” (use a lowercase filter or not depends on the application and the text: if the text is literate, then it’s not necessary, if “from the Internet”, then it’s better still use).
Next, we tokenize the text, stem each word and sequentially look for them in our trie. You can search for trigrams right away, you can pull up the second and third word as needed (if there is a partial match in the trie).
If you are too lazy to implement trie, you can replace them with hash maps, but then it is better to store names of several words as a list, and make the hash map nested (the first word in the top-level hash map points to another hash map that stores all possible second words for the specified first word; it turns out a kind of tree from hash maps).
If we assume that the search by trie / in a hash map is performed in O (1), then the entire algorithm will work in O (n), where n is the number of words in the text. In this case, you will not have to index the entire text, but only the structure for storing the basics of names (indexing in Lucene / Sphinx, generally speaking, is not the fastest operation, and the index size is about 20-30% of the text, so it’s not a fact that it will fit into memory ; naturally, I assume that the number of names is less than the size of the text).
Summary:
1. Apply stem() to all names.
2. Save the words in a structure with a quick lookup (trie/hash map).
3. Tokenize the text, apply stem() to all received words.
4. Go through the list of resulting words, while looking for them in the structure with names.
5. Profit.
Smart words: word form, normalization, computational linguistics, Porter's stemmer, aot , mystem . For starters, I hope that's enough.
And by the way, "enough perversions, dig out the stewardess." Have a look at apache solr (made with lucene).
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question