I
I
ivodopyanov2017-10-26 15:17:29
data mining
ivodopyanov, 2017-10-26 15:17:29

Are there algorithms to automatically highlight similar numeric values ​​in a set of texts?

All sorts of numerical values ​​\u200b\u200bare often found in texts - dates, phone numbers, some unique numbers like a passport. Moreover, their format is variable - the user can write a date with a dot separator, maybe with a comma separator or a slash. In the case of a telephone, the spelling variability is even greater. And the same passport number can consist of one word; from two; optionally include the words "series" and "number" or only the symbol "#". The user can put an extra space somewhere, and skip somewhere on the contrary.
Are there any algorithms for automatic clustering of similar values?
I myself am still trying to come up with something with n-grams and word2vec, having previously replaced all the numbers with one. But a good result is not obtained.

Answer the question

In order to leave comments, you need to log in

1 answer(s)
V
Vladimir Olohtonov, 2017-10-27
@sgjurano

In general, it sounds like a task for regular expressions, you should not shove neural networks everywhere.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question