Z
Z
Zhandos M2018-12-19 12:47:51
Algorithms
Zhandos M, 2018-12-19 12:47:51

How can you implement fuzzy search in a string?

Hello!
What do you think, is it possible to use suffix trees to find, let's say, a combination of characters in a string. For example, in the abcdef line, the cab substring must be found and the abc def substring will correspond to it, or fed should be found and the abc def substring will correspond to it . That is, the search for a substring is not in a clear order of characters. When as in a suffix tree, the order of characters is one of the key points of its high search performance.
Suffix trees in short: a tree where there are branches for each substring suffix, for example for abcdef there will be branches a, b, ab, c, abc, bc, d, abcd, bcd, cd, etc. Accordingly, if we are looking for any substring, we simply go through the characters from the root, if there is such a branch, then the substring exists, that is, in fact, one search pass.
I also thought about splitting the original string into substrings of a certain length and calculating their hashes, then when searching, look to see if there is such a hash, but there will obviously be a lot of such substrings and their hashes, because for example, the cbd substring should also be found in the abcdef line , its position in the initial string at the first index
Or is there some other way? The most stupid and direct of course is a linear search.
Real task case
I am working on this algorithm to implement the search for phrases in the text, in the example above, for simplicity, the words are indicated by separate characters. Of course, all words, both in phrases and in the text, are normalized. It is necessary to find the occurrence of phrases (there are thousands of them) in the text, while the positions of words in the phrase and in the text may not coincide. For example: the phrase "good morning" should be found in the text "this morning is good".
I also thought about sphinx and others, but because there are thousands of searched phrases, it will be costly to search for them for each phrase.
Thank you!

Answer the question

In order to leave comments, you need to log in

2 answer(s)
E
Evgeny Kozlov, 2018-12-19
@lebron32rus

Levenshtein distance

E
Eugene, 2018-12-19
@EpOsS

The n-gramm algorithm, the lvenshtein distance, and in general there is a very good article on this topic on Habré https://habr.com/post/114997/

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question