Answer the question
In order to leave comments, you need to log in
How to organize a search among a million or more images?
There is a task of searching for similar images in the database. The solution was made on neurons (keras), we pull out features from the pictures using the vvg16 grid (layer 4096). Works acceptable on small volumes, we use cosine distance.
The problem is that the database is very large... about 1 million pictures and this is not the limit. Vectors per 100,000 weigh about 5-7 GB. Accordingly, for 1 million pictures of vectors for 50-70GB. The search is slow and the database does not fit in RAM, and from the disk it takes a very long time. Moreover, the database will be replenished, changed and storing it in the form of a single hd5 file is not convenient.
As long as you go this way. We drive all the vectors into the database (for example, mariadb, we want PostgreSQL). Then the vectors are loaded in batches and hashed by the solution ( https://github.com/pixelogik/NearPy). The memory remains, as it were, the hashes to the database and the IDs of the pictures. The search is fast, weighs little, but not accurate (not cosines are compared, but similar ones are found by a sensitive hash)
Maybe someone else knows other solutions?
Answer the question
In order to leave comments, you need to log in
Judging by : https://www.cs.toronto.edu/~frossard/post/vgg16/vg...
I would do the following :
Store :
- image (maybe smaller copies)
- 4096-component vector
- output vector (which of 1000 components)
It would be possible to reduce the dimension with another layer, but this will already require additional training of the network.
Then:
- first we extract vectors from the image (per 1000/4096 components)
- we calculate the cosine distance by the smaller vector.
- we discard options whose cosine distance is greater than a certain limit
- we calculate the distance along a larger vector
- we discard options with a large distance
- among the rest - we compare images (possibly - reduced copies)
In theory, discarding those images that have a very different classification result should reduce the number of calculations. But for good, you need to either experiment, or count.
ps and of course - get ready to parallelize the task :-)
To compare images I use a perceptual hash using libphash0 library binding and Hamming distance (mysql: bit_count(), postgresql: hamming_distance ). Each hash is a 64-bit number. It seems to take very little.
Links:
https://habrahabr.ru/search/?q=%5Bphash%5D&target_...
https://habrahabr.ru/post/211773/
You can detect duplicates like this:
select s1.name, s2.name from images s1
inner join
(
select t2.name, t2.phash as dup_phash, hamming_distance((t1.phash), (t2.phash)) as dist from images t1
inner join images t2 on dist between 1 and 8
group by dist
having count(*) > 1
) s2 on dist between 0 and 8
take N servers and do sharding, the main thing is that there is enough memory for each.
Moreover, as far as I understand, each vector needs to be checked separately (indices do not roll) the complexity is O (1), all the more so as the number of images in one database increases, the time only for searching will increase, not counting the download from disk. when sharding, the time will not increase, since the search will be parallel.
https://ru.wikipedia.org/wiki/Locality-sensitive_h... -- an idea might help.
As an option - make copies of the image in a small resolution, and compare them using them. For example, we take pictures with a width of 50 pixels, and compare them already.
Reduce at least a little the space of signs and you will fit everything in memory. In this case, I think you will not lose almost any dispersion.
Put the server with 64-94 GB of memory?? it will be a serious investment, but in the future it will pay off with speed, etc.
Option 2: take such a server in the cloud: the start-up costs are less, at least you can drive for a couple of months and clarify your needs and already think about a local piece of iron.
Probably you will be helped by distributed storage and parallel processing of data
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question