M
M
misterkust2014-03-06 16:15:47
PHP
misterkust, 2014-03-06 16:15:47

How to normalize Russian names?

Good afternoon.
There is a text/csv file (whatever) with some user information, including "Name".
For example, "sanya", "sanya", "sasha", "aleksandr". Etc.
Those. variety of spellings of names.
The task is to bring all the names to a single form "Alexander", etc.
I rummaged all over Google / Yandex - I did not find anything on the topic.
Has anyone faced similar issues?
Perhaps someone can tell me where to get a dictionary for this case?
The implementation language is unimportant, the algorithm is interesting and, if any, the dictionary with names itself.
Thanks in advance to all who respond!

Answer the question

In order to leave comments, you need to log in

3 answer(s)
I
Ilya, 2014-03-06
@Gorily

I don’t know ready-made solutions, but perhaps this method will do:
1. Parse the list of names from Wikipedia: en.wikipedia.org/wiki/%D0%9A%D0%B0%D1%82%D0%B5%D0%...
2 Parsing pages about the name, say: en.wikipedia.org/wiki/%D0%90%D0%BB%D0%B5%D0%BA%D1%... We extract derived forms.
3. We put all this into a simple database or XML\Json file.
4. We try, edit the base, add exotic options. Those names that are completely absent in the database (including typos) are left for manual editing.
You can parse not a wiki, but download its copy from torrents for this purpose. If you still parse online, then use the mobile version.

M
Michael Danilov, 2014-03-07
@MonkAlbino

You can add a dictionary of names on Gramota.ru to the @Gorily option , but there is a big pitfall: some abbreviated names are suitable for several full names.

K
kompi, 2014-03-06
@kompi

Everything has been invented a long time ago.
See how specialized search engines do it (sphinx, lucene/solr, etc.) and what dictionaries they use.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question