How to get the same hash of two similar strings?

D

Donskoy, 2012-08-02
@blare

Simhash or charikar's hash.
Used in Google to search for similar documents. Easily reworked for strings (not bigram tokens, but bigram symbols are taken as features).
Detailed algorithm here .
The theoretical justification is in the article "Similarity estimation techniques from rounding algorithms".

B

barker, 2012-08-01
@barker

see Levenshtein distance.
And the hash is hardly the same, of course ...

E

el_gato, 2012-08-01
@el_gato

It won't work, there are too many uncertainties, especially without string comparison, for example
Васюковая -> Масюковая -> Масюковае -> Масюковое -> Масуковое -> Мосуковое -> Мозуковое -> Мозуговое -> Мозугавое -> Мозугадое
Each word is similar to the next, and therefore should have the same hash, but the difference between the first and last is already huge.

O

Ololesha Ololoev, 2012-08-01
@alexeygrigorev

Absolutely without comparisons, as it seems to me, it is impossible. If it is not possible to compare the incoming data, compare the outgoing ones - the hashes themselves.

B

bachin, 2012-08-01
@bachin

we throw out all the numbers from the line (+ spaces, + punctuation marks).
from the rest we consider the hash by any algorithm (CRC32, MD5, etc)
your strings will be matched into one value.
did you want it?
if not, take the trouble to explain what “similar lines” means - in my opinion these are completely different lines - house 12 on Vasyukovskaya street is an 8-access high-rise building, and house 121 is a dilapidated shack.

D

dime, 2012-08-01
@dime

Clarify the question
Do you need a specific algorithm that returns the same hash for these two specific strings (and no matter what for all the others)?
Or do you need a hash function algorithm generator that is given two strings as input, and the generator generates a hash algorithm such that a collision is obtained for these given strings (and again, it doesn’t matter what happens for everyone else)?

T

TheHorse, 2012-08-01
@TheHorse

You can try a hash function, which is the sum of all the characters in div C, where C is a constant, large (for example, 100). Then with a high degree of probability, strings that differ by 1-2 characters will fall into one hash.
In the general case, for addresses, the Levenshtein distance does not save, and any hash functions do not save.

S

sergpenza, 2012-08-01
@sergpenza

First, they need to be brought to some kind of the same form.
Break the line into words, bring it to the same case, throw out insignificant and non-unique ones, such as "street", punctuation marks, and so on.
Merge the words into a string, like “Moscow Vasyukovskaya”, “Moscow Vasyukova”, take the phonetic code , it will turn out, for example, 479465.
With numbers, it is somewhat unclear what the options will be. But in this case - throw out all the repetitions and leave only the numbers included in the number of both the first line and the second.
Thus, we will get two identical strings (if the phonetic code matches) of the form
“479465 12”
“479465 12”
You can calculate the hash.

A

AR1ES, 2012-08-01
@AR1ES

Similar question: stackoverflow.com/questions/10599401/hashing-similar-strings-to-same-hash-value

S

Spamkit, 2012-08-02
@Spamkit

Have you thought about fuzzy search algorithms like Soundex or Metaphone? Ready-made implementation is in Apache Codec, Metaphone and Metaphone 2 algorithms.
See here . Online demo (appears to work only with the Latin alphabet).

E

egorinsk, 2012-08-02
@egorinsk

I once thought of a similar way to search for misspelled words. One hash is definitely not enough here, and here's why. Let's say (simplify the task) all words have the same length, for example, 4 characters, we want the words that differ by 1 letter to have the same hash. Then the words abcd and abce have the same hash, the words abcd and zbcd have the same hash... in the end, all words will have the same hash.
Therefore, one hash is not enough here. You need at least a few.
For example, a hash for all letters except the first. Hash for all but the second one, etc. Then words that differ by 1 letter will have 2 matching hashes.
Or another approach is splitting words into trigrams and searching through them. For similar words, most of the trigrams will be the same.

M

Mikhail Osher, 2012-08-01
@miraage

Write a function that accepts a='Moscow, st. Vasyukovskaya 12 'and b='Moscow, st. Vasyukova 121' and will return true for certain conditions (something like a house starts the same way && the first 5-6 characters of the streets start/end the same way && the same city ). Then take all the parts that matched and take a hash over them.
Let's say, for this example, take the hash on the selected line:
Moscow, st. Vasyukovskaya 12 Moscow, st .
Vasyukov aya 12 1
Here the number of collisions is not so small.
This algorithm was invented right after reading, and most likely there are much more elegant solutions.

M

MikhailEdoshin, 2012-08-01
@MikhailEdoshin

No. How do you imagine the work of such a function? And if the house is 122? Or 21? Should the hash be the same? And if the street "Masyukovskaya" - too? And when should it be made different?
There is phonetic indexing , which gives the same hash for words that are pronounced the same way - it is quite possible that a similar function for the Russian language would give the same result for "Vasyukovskaya" and "Vasyukova", but not for the entire address. There is a trigram index for finding similar strings, but it's not a hash.

A

AR1ES, 2012-08-01
@AR1ES

Actually, the question contradicts itself.
As far as I remember, one of the main features (requirements or xs how to call it) of a hash is that it must produce completely different values, even for the closest strings.
And so I don’t know the real algorithms, but if I write another bike, then TheHorse suggested the same thing that came into my head. I would only add a little that I would use the resulting number not as a hash, but to initialize the random number generator and from it I would already draw a hash of the required length.

A

AR1ES, 2012-08-01
@AR1ES

Actually, the question contradicts itself.
As far as I remember, one of the main features (requirements or xs how to call it) of a hash is that it must produce completely different values, even for the closest strings.
And so I don’t know the real algorithms, but if I write another bike, then TheHorse suggested the same thing that came into my head. I would only add a little that I would use the resulting number not as a hash, but to initialize the random number generator and from it I would already draw a hash of the required length.

A

AR1ES, 2012-08-01
@AR1ES

Actually, the question contradicts itself.
As far as I remember, one of the main features (requirements or xs how to call it) of a hash is that it must produce completely different values, even for the closest strings.
And so I don’t know the real algorithms, but if I write another bike, then TheHorse suggested the same thing that came into my head. I would only add a little that I would use the resulting number not as a hash, but to initialize the random number generator and from it I would already draw a hash of the required length.

E

Evgeny Leshchenko, 2012-08-01
@xsen

It is possible to optimize the shingles algorithm, but use n-th number of characters instead of words.

D

Dilon, 2012-08-01
@Dilon

At one time I happened to use ssdeep.sourceforge.net/ for a similar task.

E

EndUser, 2012-08-01
@EndUser

Mdya ... Why do you need such a function?
The hash function is characterized by minimal collisions.
You seem to need a different function in the task. For example "index".

D

Donskoy, 2012-08-02
@Donskoy

If you need to compare proper names, then instead of Levenshtein, Jaro-Winkler distance is better .
This metric was designed to find names with similar spellings (or spelling errors) in the US Census.

R

Ryadovoy, 2012-08-08
@Ryadovoy

See the method Signature hashing + Levenshtein distance
habrahabr.ru/post/114997/