What duplicate search algorithm would you recommend Sql server?

B

bearpaw2017-11-30 07:43:43

SQL Server

bearpaw, 2017-11-30 07:43:43

Hello!
There is some table with nomenclature names. About 70K records.
It is necessary that each newly introduced nomenclature is checked for uniqueness.
Sometimes employees write instead of Russian x (c) Latin x (c).
We tried Jaro-Winkler similarity, Damerau-Levenshtein distance (in the form of functions on MS SQL Server). It all works and even works perfectly. Except one - time. It is necessary that the operation takes 10-20 seconds, and these algorithms work from 2-3 minutes to 10-15. Moreover, the time increases logarithmically depending on the length of the new word, because the parameter is calculated for each row. Normalization of the search string (we remove spaces, commas, dots, etc.) gives a gain of 3-4 seconds.
Maybe someone knows or has seen fast similarity search algorithms? For example based on NGram, or using hashes. In principle, the string length has a fixed size of no more than 150 characters. Can somehow decompose existing records so that they can be quickly searched

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

A

Alexander Taratin, 2017-11-30
@Taraflex

Probably already read, but suddenly not
https://habrahabr.ru/post/342434/

E

Eugene, 2017-11-30
@klim76

logging of adding records and inevitable punishment for the most notorious hacks at your service.
But it will be necessary to implement a report that once a day / week / month will wool the names for dubbing

D

d-stream, 2017-11-30
@d-stream

At the time of creating a single position, it is quite possible to spend some time checking.
That is, to search in 100500 records for records similar to duplicates - this can take hours, but when creating a product card - which is actually a rare and responsible operation (not available to everyone) - waiting 3-5-10 seconds for verification is quite acceptable.

B

basrach, 2017-12-02
@basrach

As an option. You can first bring the lines to some kind of "normalized form". For example, remove everything except letters, remove places with the most likely errors (letters a, o), replace all Cyrillic with Latin, etc. and then put it all in a dictionary, where each such "hash" will correspond to 5-10 similar items. When adding a new entry, first calculate the "hash", then go through the 5-10-15 entries corresponding to it with the normal matching function. The idea is not to wool all 70k each time, over 90% of them obviously don't even come close. Actually, you just need to find a way to filter out these 90+% in advance.