What are the methods for extracting the Surname Name Patronymic from the text?

F

Fedor Malyshkin2015-12-29 21:49:12

Programming

Fedor Malyshkin, 2015-12-29 21:49:12

What are the methods for extracting the Surname Name Patronymic from the text?
Of course, it is difficult to assume the existence of any universal algorithm; rather, practical developments and articles are of interest, giving an idea of the directions in which one can move.
Interested in the possibility of highlighting the full name in the format "Putin V.V. / Putin Vladminir Vladimirovich / Vladminir Vladimirovich" (in different cases), etc.

Reply

Answer the question

In order to leave comments, you need to log in

6 answer(s)

X

xmoonlight, 2015-12-29
@fedor_malyshkin

Well, here's a template for writing regexp right off the bat:
1. two or three words (separator: not a letter or number (or several))
2. all 1st letters are capitalized,
3. at least one word - does NOT consist of 1 letter.
4. if the word consists of the 1st letter - the next character is required: "." (dot)
5 Levenshtein distance according to the dictionary of names, surnames and patronymics (through their combinations).

S

Sergey, 2015-12-30
@begemot_sun

Try the Tomita parser from Yandex.

F

frees2, 2015-12-29
@frees2

Google is working on a semantic search, where each name and surname is assigned a separate kryakozyabr (/m/05qt0" - Politics, Putin is in Russian and even Mizulina) with dashes (well, the rest of the words, the search is already working, I have the same for several expressions) , in English it's easy and in Russian there are few words. v3 it's supposed to be looked for there. In 10 years, such a problem probably won't arise.

A

Alexey Yeletsky, 2015-12-29
@Tiendil

There should be cases with a list of all names / surnames / patronymics. At a minimum, you can automatically select from wikipedia or dbpedia (structured data from wikipedia).
1. We find cases or form our own.
2. We do a search for a match with at least one word (better, for a partial match - in case of typos and any declensions).
3. When we find it, we select the neighborhood of the word (a pair of words left-right) and analyze it with heuristics.

T

to_climb, 2015-12-30
@to_climb

If the task is serious (not a hobby), then since Tomita was mentioned, I will also mention such a thresher of texts: ABBYY Tagger . Dictionaries and rules included. But it's not a budget option.

N

Nikita Zhiltsov, 2015-04-26
@nzhiltsov

This is called the named entity recognition problem. In your case, the names of persons. Our Textocat API product can do this: see for yourself on the demo page for the Russian language or get a free API key after registering on our website.