How to implement smart string comparison?

Turtle_Onni2016-12-29 09:11:14

SQL

Turtle_Onni, 2016-12-29 09:11:14

Всем привет!
Возникла проблема со сравнением строк.
Задача: Из одной таблицы(она огромна) необходимо сравнить наименование из другой таблицы(тоже огромна, более 20 тысяч строк). Пример наименование1.таблица1 = "Автоматический выключатель KDJF39484", наименование2.таблица2 = "Выключатель автоматический KDJF39484". Или "гайка СБВ-4В ГОСТ.83" и "Гайка СБВ-4ВГОСТ.83". Могут быть разные разделители и присуствие или отсутствие пробелов.
Каким образом сравнить эти строки чтобы выявить что это одно и тоже название?
Буду признательна ссылкам и литературе, вашим советам.

Answer the question

In order to leave comments, you need to log in

8 answer(s)

�

Артем Воронов, 2016-12-29
@newross

Можно начать с расстояния Левенштейна.

Schullz, 2016-12-29
@Schullz

Без внятного ТЗ результат ХЗ

Alex, 2016-12-29
@streetflush

LCASE( REPLACE ( REPLACE ( REPLACE ( REPLACE ( )))) = LCASE( REPLACE ( REPLACE ( REPLACE ( REPLACE ( ))))

Dmitry Dart, 2016-12-29
@gobananas

In PHP there is, but in C # you will have to shift it yourself:

function GetUniqueStrings($a=array()){

  if(is_array($a)) {
    $out = array();
    foreach($a as $k=>$str){

             	$str = strip_tags(htmlspecialchars(trim($str)));
            	$str = str_replace('  ', ' ', $str);
            	$str = mb_strtolower($str, mb_detect_encoding($str));//не учитываем регистр

      $arr = explode(' ', $str);
  
      $out[$k] = array_reduce($arr, function($sum, $cur){
        $sum += sprintf("%u", crc32($cur));
        return $sum;
      });
    }
    return array_intersect_key($a, array_unique($out));
  }
}
 

$MyArray = [
'автоматический выключатель KDJF39484',
'выключатель автоматический KDJF39484',
'KDJF39484 автоматический выключатель'
];

print_r(GetUniqueStrings($MyArray));

Hyphens can be replaced with spaces. In the case of misprints like 4VGOST, something like Levenshtein is really needed.

xmoonlight, 2016-12-29
@xmoonlight

Fuzzy string comparison (including typos, phonetics, etc.): here .
Also, on the page you can test on your data.

al_gon, 2016-12-29
@al_gon

By big data only with a search engine.
Everything else is a 1:1 comparison. This is not industrial scale.
How to implement a similarity search for two strings (product names)?

Roman Mirilaczvili, 2016-12-30
@2ord

I came up with the following primitive algorithm:
According to this algorithm, from the lines "Automatic switch KDJF39484" and "Automatic switch KDJF39484" the output will be the same string "kdjf39484 automatic switch".
Thus, the Levenshtein distance between them will be zero.
For the strings "nut SBV-4V GOST.83", "Nut SBV-4VGOST.83" the result is worse than expected because the words "SBV-4V GOST.83" are sorted as "gost.83 sbv-4v". It so happened due to the fact that the catalog number and the name of the standard were written together.
To solve this problem, I would get from the database a representative selection of all possible variants of names in which "GOST" appears (no difference in capital or lower case) and then I would make rules for separating from other words through a space using regular expressions or other methods. The same is true for other standard names (ISO, ...).
Ruby example:
https://gist.github.com/romiras/386e3694a59949f6be...

simple_fuzzy_match(
  ' Автоматический выключатель KDJF39484  ',
  'Выключатель автоматический  KDJF39484'
)

выдаёт разницу 0 для "нормализованных" строк, следовательно, они идентичны.

igruschkafox, 2017-01-06
@igruschkafox

All those who wrote above without education :)
Sorry guys :)
If we are talking about MS SQL
, then Integration Services has Fussy Group and Fussy LookUP
There, the comparison is based on 7 algorithms (the distance between letters, how many characters need to be replaced, the distance between characters on the keyboard, etc. )
Here is a video for those who are too lazy to google
https://www.techdays.ru/videos/1378.html
Oracle also has this (in my opinion it is called the Date Director)