How do perceptual hashes compare?

A

Alexander Alekseev2021-05-27 08:47:55

Mathematics

Alexander Alekseev, 2021-05-27 08:47:55

There are many good and understandable articles on the Internet about what perceptual hashes are. However, what happens when we set a search on the base of these hashes? On the one hand, hashes can be indexed. In this case, the search will be very fast, but we will not be able to find slightly different images. On the other hand, you can search through all the hashes in the database. In this case, we will have to calculate the Hamming distance for each pair (given sample and hash from the data bad). This gives you more flexibility in searching, but the search itself becomes very slow. Even for a million hashes, the search will take tens of seconds! So how do they actually compare?

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

W

Wataru, 2021-05-27
@wataru

There are all sorts of indexes that allow you to search for a match with several errors. Of course, there will be some amount of extra work, but all the same, you need to look through only a small part of all hashes in the database.
For example, you can index all sequences of n characters in a row in each hash. Then search this index for all sequences of n characters in the pattern. Then the hashes from the resulting list are already checked for the hamming distance. If we take n < L/k, where L is the length of the hash and k is the allowed number of errors, then all the necessary hashes will be included in the list. The larger n, the less extra will be in the list.
Another example is the use of a trie. All hashes add up to boron. Then a recursive algorithm is launched there, which can make k errors (function parameter). It either follows the current symbol or makes a mistake and follows any other edge, but it can make a maximum of k-1 mistakes there. Of course, this method will lead to dead ends, but it will traverse only a small fraction of the entire tree.
Or, more optimal for searching, but less fast when adding, option - all hashes with all possible errors are stored in the index. It will be much fatter, of course, but the search will work quickly.

A

Alexander Alekseev, 2021-05-27
@alekseev_ap

For example, you can index all sequences of n characters in a row in each hash. Then search this index for all sequences of n characters in the pattern. Then the hashes from the resulting list are already checked for the hamming distance. If we take n < L/k, where L is the hash length and k is the allowed number of errors, then all the necessary hashes will be included in the list. The larger n, the less extra will be in the list.

How many sequences will that be? Hundreds? Thousands? For every hash?

Another example is the use of a trie. All hashes add up to boron. Then a recursive algorithm is launched there, which can make k errors (function parameter). It either follows the current symbol or makes a mistake and follows any other edge, but it can make a maximum of k-1 mistakes there. Of course, this method will lead to dead ends, but it will traverse only a small fraction of the entire tree.

Even if so, where to store this boron? In RAM? Expensive, and even impossible with a large database. Rip from disk? Even with an SSD, this will take many (very many) operations. Even if we assume that our hash consists of 8x8 bits (8 bytes), then the number of options, even taking into account 1 error, is already 64 pieces. For two mistakes - several thousand!

Or, more optimal for searching, but less fast when adding, option - all hashes with all possible errors are stored in the index. It will be much fatter, of course, but the search will work quickly.

It's a total bummer! I think you just do not understand that the size of the indexes will increase thousands of times even for two errors!
If the base is small, then it is easier, of course, to carry out an element-by-element comparison. I'm interested in how big guys like Yandex or Google or TinEye do it.
I wrote my own image search algorithm, which is fundamentally different from perceptual hashes, but I'm interested to know how they all work. Do they do image searches on hundreds and thousands of servers?

K

Karpion, 2021-06-05
@Karpion

The perceptual hash is good because it is the same for slightly different images.
And you can also organize a database with these hashes so that each hash contains links to all hashes that differ little from it. Well, or you can, along the way, generate all hashes that differ from the given one by one bit, by two bits, by three bits, etc; well and to request a DB on these variations.

R

Roman Mirilaczvili, 2021-06-06
@2ord

Calculating Hamming distance on a large dataset