Answer the question
In order to leave comments, you need to log in
Library for morphological parsing of phrases in Russian?
What library can do morphological parsing of phrases in Russian?
Need something that can be used in php.
More. There are two lists of phrases at the input - basic and extended. It is necessary to match each phrase from the main list with all possible phrases from the extended list, taking into account the morphology of the Russian language.
For example, the main list (one-dimensional array):
1. buy medical scales
2. orthopedic mattress
Extended list (one-dimensional array):
1. buy medical scales in Moscow
2. buy medical scales in Perm
3. buy medical scales
4. sell medical scales
5 buy medical scales
6. orthopedic mattresses
7. orthopedic mattress
8. sale of orthopedic mattresses
9. one-and-a-half mattress
At the exit, you need to understand which of the phrases in the extended list include any phrase from the main list. To end up with the following list (two-dimensional array):
1. buy a medical scale:
1.1. buy medical scales in moscow
1.2. buy medical scales in Perm
1.3. purchase of medical scales
2. orthopedic mattress:
2.1. orthopedic mattresses
2.2. orthopedic mattress
2.3. sale of orthopedic mattresses
I understand that as a result of a morphologist. analysis of phrases, possible errors and not all variants of phrases will be connected. I’m even ready to put up with such a mistake when the phrases “whaling” and “Chinese industry” are connected.
Answer the question
In order to leave comments, you need to log in
See similar_text() function:
$sovpalo=similar_text($stroka1,$stroka2,$prc);
Return value: — number of matched characters.
Examination:
if ($prc>10 && $sovpalo>=mb_strlen($stroka1)/2) {
/*
(если процент совпадения больше 10 и кол-во совпавших символов больше половины)
помещаем в подпункт...
*/
}
For lemmatization and declension of words, there is phpmorphy . Probably, you will have to do the work with phrases yourself.
In a narrow topic, the use of finite automata is effective.
There will be a lot of errors in the wide one, you need to sort by subject. Then apply lemmatization and then evaluate through similarity word by word.
As a result, it turns out quite slowly on samples of more than 50 phrases.
For this reason, it may be more efficient to use lemmas by cutting suffixes, prefixes and endings, highlighting only the roots (as for English phrases), then look for similar ones. The performance will be acceptable, but there will be more errors.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question