How to get named entities from text?

L

litehaus2016-07-25 17:56:01

Algorithms

litehaus, 2016-07-25 17:56:01

Good afternoon. Maybe someone came across algorithms for obtaining named entities from text, for example, in the text
"Apple will release an iPhone Pro version instead of iPhone 7 Plus" how to get the words Apple, iPhone 7 Plus, iPhone Pro or from the text "French Air Force reported escort of two Tu-160 over the English Channel" get the Air Force, France, Tu-160, the English Channel. The first thing that comes to mind is to pull out to pull out words that do not start with a capital letter at the beginning of a sentence, but this is not suitable only for names, and then only for those that do not go at the beginning of a sentence. the second option is to do a dictionary search, but again, how to take out phrases from the text, for example, the phrase "iPhone 7 Plus" in general, I will be grateful if anyone can help me if there is any ready-made library, I will be very grateful

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

X

xmoonlight, 2016-07-25
@xmoonlight

The easiest way is to make a counter sample:
1. Select nouns
2. Search for other parts of speech (except for nouns) and "etch" from the text.
3. Combine received in item 1 and item 2.
4. Clear (if possible) the result of step 3 from "garbage" (here you have to think: something with regex)
You can try my lisaped (PHP) for sampling parts of speech.
UPD : I'll add more following the example in the question:
1. We scan all words that have at least one capital letter, any letter and number, a language that differs (from the main one, by boundaries - we consider the change of language to the main one). (Result - 1)
2. We select all the same words that are both at the beginning of the sentence and in the middle of any sentence, having at least one capital letter inside the word anywhere.
(Result - 2)
3. We combine the result of item 1 and item 2 - we get Result-3.
4. We are looking for the words "result-3" in the text and if to the left or to the right of the word (ignoring everything except words) there is a paired delimiter (quotes, brackets, etc.) - we are looking for the second delimiter and everything that is between these delimiters - glue it to the current one and put it in Result-4. Otherwise, we enter the current word itself.
5. Result-4 - will contain everything that is needed.

A

abcd0x00, 2016-07-27
@abcd0x00

As you define it yourself (in your brain), so do the program.
Suppose you see that a noun is when the verb is on the right, which means that you understand that the verb is a verb. How did you understand? And you have a list of verbs in your memory, because you studied at school and memorized them. If you are given a text in Chinese, you will not choose anything from there, because there is no list of verbs in memory, since you were not in a Chinese school. This means that your program should store a list of verbs, and to the left of them you need to look at the word. But the word on the left can also be an adverb. And how to understand that the adverb "easy", for example, is an adverb? And you need to have a list of adverbs. This is how the program should look at exactly the same lists that you have in your head.
You just need to analyze how you yourself understand where the right word is and where it is not, and then make a program that is as close as possible to this algorithm.
The analysis of real texts is a complex topic. But if you have some initial conditions, you can write something.