Are there algorithms to automatically highlight similar numeric values in a set of texts?

I

ivodopyanov2017-10-26 15:17:29

data mining

ivodopyanov, 2017-10-26 15:17:29

All sorts of numerical values \u200b\u200bare often found in texts - dates, phone numbers, some unique numbers like a passport. Moreover, their format is variable - the user can write a date with a dot separator, maybe with a comma separator or a slash. In the case of a telephone, the spelling variability is even greater. And the same passport number can consist of one word; from two; optionally include the words "series" and "number" or only the symbol "#". The user can put an extra space somewhere, and skip somewhere on the contrary.
Are there any algorithms for automatic clustering of similar values?
I myself am still trying to come up with something with n-grams and word2vec, having previously replaced all the numbers with one. But a good result is not obtained.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

V

Vladimir Olohtonov, 2017-10-27
@sgjurano

In general, it sounds like a task for regular expressions, you should not shove neural networks everywhere.