What to read about word normalization algorithms?

M

Mercury132014-03-14 03:23:06

Algorithms

Mercury13, 2014-03-14 03:23:06

I want to pull out all possible initial forms for a word. That is, for the long-suffering “daughter of the general” there should be “daughter” (verb), “daughter” (noun), “generate” (verb), “general” (noun), and, so be it, “daughter”, if, nevertheless, the program has a list of "wrong" words.
1. Where can I read about it?
2. What is the best way to formalize all these rules in data files (XML or similar)?
3. Is there somewhere a corpus of Russian words indicating parts of speech and all forms?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

I

Ivan Starkov, 2014-03-14
@Mercury13

There are several options for the Russian language, some of them only get the basic form, some give out in addition to the basic form what part of speech it is:
Point by point from the simplest to the most complex, and I don’t know if these tools work under windows, I myself use osx and linux
1) Stemmer tools - stemmer, cut off the word by tearing out of it according to some rules what they take for - endings, suffixes, prefixes.
Personally, I really like the stemmer for the Russian language from the package https://github.com/NaturalNode/natural
Here is a simple code to understand how the stemmer works https://github.com/NaturalNode/natural/blob/master...
Advantages of stemmers: very fast, suitable for preliminary analysis in 100% of cases
Disadvantages : the base word form is very far from the real base word form for periods
=========================== ===============================
2) use aspell - linux package for spell checking
example echo chris rode a bike | aspell -a -d russian --sug-mode=ultra
output:
+ ride
& krisa 13 6: kitty, rice, iris, beauty
*
+ bike
Disadvantages: slow, does not speak what part of speech
============================ =========================
example: echo 'a woman sowed peas' | cmd/tree-tagger-russian
output:
woman Ncfsny
sowed Vmis-sfa-e
peas Ncmsan
decryption of entries like Ncfsny here corpus.leeds.ac.uk/mocky/msd-ru.html
that is, in addition to what part of speech and basic word form it is this thing also gives out a bunch of additional information - from cases to ....
Advantages: great!, determines the part of speech even if it does not have a word in the dictionary
Disadvantages: it defines parts of speech, but not always with the basic word form, you have to use it in tandem with aspell or a stemmer. The slowest.
================================================= ======
There are some other products from Yandex https://company.yandex.ru/technologies/mystem/
I did not use.
Good luck!