How to organize a search among a million or more images?

Rinat Bakiev2016-12-15 12:34:48

PostgreSQL

Rinat Bakiev, 2016-12-15 12:34:48

There is a task of searching for similar images in the database. The solution was made on neurons (keras), we pull out features from the pictures using the vvg16 grid (layer 4096). Works acceptable on small volumes, we use cosine distance.
The problem is that the database is very large... about 1 million pictures and this is not the limit. Vectors per 100,000 weigh about 5-7 GB. Accordingly, for 1 million pictures of vectors for 50-70GB. The search is slow and the database does not fit in RAM, and from the disk it takes a very long time. Moreover, the database will be replenished, changed and storing it in the form of a single hd5 file is not convenient.
As long as you go this way. We drive all the vectors into the database (for example, mariadb, we want PostgreSQL). Then the vectors are loaded in batches and hashed by the solution ( https://github.com/pixelogik/NearPy). The memory remains, as it were, the hashes to the database and the IDs of the pictures. The search is fast, weighs little, but not accurate (not cosines are compared, but similar ones are found by a sensitive hash)
Maybe someone else knows other solutions?

Answer the question

In order to leave comments, you need to log in

9 answer(s)

Alexander Pozharsky, 2016-12-16
@alex4321

Judging by : https://www.cs.toronto.edu/~frossard/post/vgg16/vg...
I would do the following :
Store :
- image (maybe smaller copies)
- 4096-component vector
- output vector (which of 1000 components)
It would be possible to reduce the dimension with another layer, but this will already require additional training of the network.
Then:
- first we extract vectors from the image (per 1000/4096 components)
- we calculate the cosine distance by the smaller vector.
- we discard options whose cosine distance is greater than a certain limit
- we calculate the distance along a larger vector
- we discard options with a large distance
- among the rest - we compare images (possibly - reduced copies)
In theory, discarding those images that have a very different classification result should reduce the number of calculations. But for good, you need to either experiment, or count.
ps and of course - get ready to parallelize the task :-)

Roman Mirilaczvili, 2016-12-17
@2ord

To compare images I use a perceptual hash using libphash0 library binding and Hamming distance (mysql: bit_count(), postgresql: hamming_distance ). Each hash is a 64-bit number. It seems to take very little.
Links:
https://habrahabr.ru/search/?q=%5Bphash%5D&target_...
https://habrahabr.ru/post/211773/
You can detect duplicates like this:

select s1.name, s2.name from images s1
inner join
(
    select t2.name, t2.phash as dup_phash, hamming_distance((t1.phash), (t2.phash)) as dist from images t1
    inner join images t2 on dist between 1 and 8
    group by dist
    having count(*) > 1
) s2 on dist between 0 and 8

Index by images.phash .

Yuri, 2016-12-15
@riky

take N servers and do sharding, the main thing is that there is enough memory for each.
Moreover, as far as I understand, each vector needs to be checked separately (indices do not roll) the complexity is O (1), all the more so as the number of images in one database increases, the time only for searching will increase, not counting the download from disk. when sharding, the time will not increase, since the search will be parallel.

Sergey, 2016-12-15
@begemot_sun

https://ru.wikipedia.org/wiki/Locality-sensitive_h... -- an idea might help.

ThunderCat, 2016-12-16
@ThunderCat

As an option - make copies of the image in a small resolution, and compare them using them. For example, we take pictures with a width of 50 pixels, and compare them already.

Roman Mindlin, 2016-12-16
@kgbplus

Reduce at least a little the space of signs and you will fit everything in memory. In this case, I think you will not lose almost any dispersion.

pfg21, 2016-12-16
@pfg21

Put the server with 64-94 GB of memory?? it will be a serious investment, but in the future it will pay off with speed, etc.
Option 2: take such a server in the cloud: the start-up costs are less, at least you can drive for a couple of months and clarify your needs and already think about a local piece of iron.

Artemy, 2016-12-15
@MetaAbstract

Probably you will be helped by distributed storage and parallel processing of data

potapm, 2017-03-28
@potapm

As far as I understand, the entire network is simply run through and then the penultimate layer for each image is stored. The size of the penultimate layer can be set to any size, depending on the required accuracy.