How to count the number of repeated letters (segments) in a set of words?

A

Alexey Nikolaev2014-11-01 19:22:48

PHP

Alexey Nikolaev, 2014-11-01 19:22:48

Good evening everyone.
There is a non-trivial task - you need to select from an array of words those that have a repeating beginning, i.e. count the words that have the highest probability of semantic similarity. For example, "freeway" and "car" will appear in the final search results. This can be done with several nested loops (loops generally replace almost any algorithm), but the beauty and speed of such a solution is in great doubt ...
How would you try to implement something like that? .. How can this be implemented at all (maybe there are known algorithms) ? I would be grateful for advice, thanks.
PS libraries (like phpMorphy) are possible but not desirable

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

Z

Zaur Ashurbekov, 2014-11-01
@Heian

if the array of words is large, I suggest creating oriented trees, the node of which will be a letter, the vertex is the first letter of the word, the second level will be the second letters, etc. to the end of all words. And the number of similarities can be determined by the number of nodes, the level of similarity - by the level of the node. Example:
Words Freeway, Car Aviation
Count:

А - В - Т - О - С - Т - Р - А - Д - А
    |       |
    И       М - О - Й - К - А
    |       |
    А       О
    |       |
    Ц       Б
    |       |
    И       И
    |       |
    Я       Л
            |
            Ь

Such trees should be created for each letter with which the words in the dictionary begin

I

Ilya Plotnikov, 2014-11-02
@ilyaplot

Perhaps sphinx should be used?

M

Mnobody, 2014-11-06
@Mnobody

A slightly different task is described here, but it may prompt some ideas (if you want to understand the issue).
habrahabr.ru/post/190694
You can also google stimmers and lemmatizers.