Search and selection of words in the text (Algorithm)

W

winbackgo2012-07-10 18:34:48

Sphinx

winbackgo, 2012-07-10 18:34:48

There is a database of people, each entry can contain an unlimited number of synonyms (names written in different languages, nicknames, etc.). All entries (both synonyms and titles) are absolutely unique. Based on this data, an index table is built (let's call it source) which can be represented as follows:

name VARCHAR(128),
person_id INT(10),
UNIQUE KEY name (name)

So, the task is to find and highlight in the text the words found in the database. The task is complicated by the fact that in the text the names can be in different case forms.

My bicycle:

On the basis of source, we build an index in sphinx consisting of trigrams, we discard entries less than 2 characters long.
In the text, select all words that begin with a capital letter, ignoring two-character ones. If the words are in a row, then we consider them as one (full name). Be wary of words at the beginning of sentences.
We turn them into trigrams and send a request to the sphinx.
We select one of the most suitable results and then highlight it in the text.

I mean that the error threshold in this case will be about 5%, which in principle is acceptable (the erroneous one can be corrected manually).

In general, I would like to know: What are the shortcomings in my algorithm and what can be suggestions for improvement? What are the solutions to similar problems (for sure there are, but I'm stupid, or I don't know beautiful words, so Google does not give anything intelligible)? How would you solve this problem?

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

F

ffriend, 2012-07-14
@ffriend

To begin with, you need a separate function or utility stem (), which for any word will return its stem (this is exactly what search engines do, including Sphinx and Lucene, from where you can get it and use it for your own purposes). Also, the tokenize () function does not interfere, breaking
If there is such a function. We go through all the words from the dictionary (source), and from the resulting bases we build a trie in which the corresponding person_id will be the leaves. Names of 2 or more words are translated into n-grams of the corresponding stems, separated by spaces. Those. the name “Anna Mikhailovna Ivanova” will be translated into something like “Anna Mikhailovna Ivanov” (use a lowercase filter or not depends on the application and the text: if the text is literate, then it’s not necessary, if “from the Internet”, then it’s better still use).
Next, we tokenize the text, stem each word and sequentially look for them in our trie. You can search for trigrams right away, you can pull up the second and third word as needed (if there is a partial match in the trie).
If you are too lazy to implement trie, you can replace them with hash maps, but then it is better to store names of several words as a list, and make the hash map nested (the first word in the top-level hash map points to another hash map that stores all possible second words for the specified first word; it turns out a kind of tree from hash maps).
If we assume that the search by trie / in a hash map is performed in O (1), then the entire algorithm will work in O (n), where n is the number of words in the text. In this case, you will not have to index the entire text, but only the structure for storing the basics of names (indexing in Lucene / Sphinx, generally speaking, is not the fastest operation, and the index size is about 20-30% of the text, so it’s not a fact that it will fit into memory ; naturally, I assume that the number of names is less than the size of the text).
Summary:
1. Apply stem() to all names.
2. Save the words in a structure with a quick lookup (trie/hash map).
3. Tokenize the text, apply stem() to all received words.
4. Go through the list of resulting words, while looking for them in the structure with names.
5. Profit.

G

grossws, 2012-07-13
@grossws

Smart words: word form, normalization, computational linguistics, Porter's stemmer, aot , mystem . For starters, I hope that's enough.
And by the way, "enough perversions, dig out the stewardess." Have a look at apache solr (made with lucene).

W

winbackgo, 2012-07-15
@winbackgo

It's not entirely clear to me where in the trie there is a place for person_id.