How to determine the types of numbers in the text?

S

Sergey Tikhonov2017-11-29 19:01:05

Neural networks

Sergey Tikhonov, 2017-11-29 19:01:05

Let's say there are two types of numbers in the text:

The salary
Telephone

There is a source text: "We need programmers 25000 28315"
That's where - 25000 should be defined as "salary", and 28315 should be defined as "phone".
At the same time, the salary can be written like this: 25,000, 25t 25 tr. 25,000 rubles, etc.
The phone can be written like this: phone 28315, phone 28315, 2 83 15, 2-83-15, 283-15, etc.
But there is another option that the last two numbers will be 28310, 28315 - and these are phone numbers.
Now such ads are sorted out regularly, but there is a terrible nightmarish horror from which there are a lot of problems.
I see several solutions: 1 is to actually write and train a neural network. 2 is to describe the algorithm and I see it something like this: in the first iteration we run all the ads and hang tokens on the text, in the second iteration we read only numbers and here we already learn to read the context of the entire text. For example, for numbers, we have several checks for the phone and for the salary, after performing the checks, we accumulate points for the number and determine if it is more than necessary, then we consider it to be the desired type.
The problem with the neuron is that I am not strong in it and I don’t understand how to transfer the entire text of the ad to it so that it can read it.
The problem with the algorithm, its complexity scares me. I still do not fully understand all the pitfalls that may appear.
There is not much time to solve the problem (Deadline).
I ask for help, did you solve a similar problem, how?
How to feed the entire sentence to the neuron so that it reads it?
Or a neuron in the window and think out an algorithm in the old fashioned way?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

I

ivodopyanov, 2017-11-30
@Keyon

This task is called Named Entity Recognition (NER), and the state-of-the-art solution to it is BiLSTM + CRF. There is an example here: https://github.com/farizrahman4u/keras-contrib/blo...
The main problem is how to mark up the dataset. When I recently solved the same problem, I came up with the idea of doing this:
1) We replace all the numbers with one character. For example, "0".
2) Break the text into words.
3) We build a frequency dictionary of n-grams (n = from 1 to some number k, chosen manually), such that they include at least one word with a digit. N-grams are needed because there are many numerical facts of more than one word - the same phone numbers in the format 7 000 000 00 00 or passport numbers 0000 000000.
4) Generate vector representations for these n-grams using word2vec or equivalents. Those. we break the phrase into words, and then merge n words around some word with numbers and give it to the model. Thus, n-grams of various spellings of telephone numbers will be located more or less nearby.
5) We start manually labeling n-grams sorted by frequency. If desired, then you can take some labeled cluster and label n-grams sorted by distance to the center of the cluster. Those. first roughly determined where the cluster of telephone numbers is located, and then clearly outlined its boundary. I wrote a telegram bot for all this.
6) As a result, we got the clustering of n-grams - then it is easy to get masks of phrases with labels and set that neural model on it.

V

Vlad_Fedorenko, 2017-11-30
@Vlad_Fedorenko

Now such ads are sorted out regularly, but there is a terrible nightmarish horror from which there are a lot of problems.

But everything is fine in neural networks, gathering dust and waiting for someone to teach them using ten examples