Who will tell the JS-libu to highlight the Russian roots of words?

S

spmbt2012-09-17 13:51:59

JavaScript

spmbt, 2012-09-17 13:51:59

To count the number of identical words without taking into account cases and declensions, you need a library that knows the rules of declensions and, possibly, a number of exceptions to commonly used words (such as “ice-ice”, “go-go-go”). From it it would be possible to make a word frequency counter in the article and fasten it to Habr, which would better show the direction of the article than the tags and hubs chosen by the author. In general, its orientation is as follows: let it not work very accurately (all the same, errors in the article reduce accuracy to nothing), but it creates an idea of frequent words. (Then we will remove the general vocabulary, but these are details - we need an engine).

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

A

Andrew D.Laptev, 2012-09-17
@agsh

I won’t tell you anything about highlighting roots, but perhaps a stemming library is suitable for the task (about the same thing, only the stem of the word is highlighted, not the morphological root): urim.googlecode.com/svn/jsSnowball/stemmer/src/ext/RussianStemmer. js

A

Andrew D.Laptev, 2012-09-19
@agsh

Then you can probably do something like this:
exceptions -> the stem of the word, or the word for stemming (from the Spanish dictionary, “ice -> ice” or “ice -> ice”)
then
stemming words -> the stem of the word (“ice”, “ice” -> “ice”),
and then
the resulting word stems -> standard word (from the Spanish dictionary, “ice -> ice, horse -> horse”)
I have never done this. If you take it and find dictionaries, write to me about the results, please.