Answer the question
In order to leave comments, you need to log in
How to determine the types of numbers in the text?
Let's say there are two types of numbers in the text:
Answer the question
In order to leave comments, you need to log in
This task is called Named Entity Recognition (NER), and the state-of-the-art solution to it is BiLSTM + CRF. There is an example here: https://github.com/farizrahman4u/keras-contrib/blo...
The main problem is how to mark up the dataset. When I recently solved the same problem, I came up with the idea of doing this:
1) We replace all the numbers with one character. For example, "0".
2) Break the text into words.
3) We build a frequency dictionary of n-grams (n = from 1 to some number k, chosen manually), such that they include at least one word with a digit. N-grams are needed because there are many numerical facts of more than one word - the same phone numbers in the format 7 000 000 00 00 or passport numbers 0000 000000.
4) Generate vector representations for these n-grams using word2vec or equivalents. Those. we break the phrase into words, and then merge n words around some word with numbers and give it to the model. Thus, n-grams of various spellings of telephone numbers will be located more or less nearby.
5) We start manually labeling n-grams sorted by frequency. If desired, then you can take some labeled cluster and label n-grams sorted by distance to the center of the cluster. Those. first roughly determined where the cluster of telephone numbers is located, and then clearly outlined its boundary. I wrote a telegram bot for all this.
6) As a result, we got the clustering of n-grams - then it is easy to get masks of phrases with labels and set that neural model on it.
Now such ads are sorted out regularly, but there is a terrible nightmarish horror from which there are a lot of problems.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question