How to solve the problem of finding duplicates from 1 billion images?

S

SemenDemon2019-06-17 02:12:27

big data

SemenDemon, 2019-06-17 02:12:27

I run the image through GoogleNet, take the last layer loss3 / classifier
, how to look further for this?
options:

use faiss or SPTAG (difficult, resource intensive)
binarize vectors and look for the closest by hamming as ok.ru does (an option, but where to look for such a volume of photos
cluster k-means (well, yes, similar ones will be with similar ones, but it’s better to use faiss)
use a perceptual hash (very zero precision, if it could be done for vectors)

Is it possible to receive something like Simhash? those. using a set of features from a neural network, get the same hash for a photo with a similarity of 80%, then put this hash into a cassandra.
When searching, search only by hash, i.e. quickly get all hotos that are 80% similar to the uploaded image.
real estate images.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

S

SemenDemon, 2019-09-09
@SemenDemon

faiss published a 1 billion image search article. there with clustering and everything is quite interesting.
But it turned out to be easiest to use lsh hashing and cassandra.

V

Vladimir Olohtonov, 2019-06-17
@sgjurano

1) I haven’t read specifically about this network, but the task of “building a space suitable for searching for similar objects” and “recognizing a class of objects” are different tasks that require different learning processes;
2) the dataset search task is a separate task that requires preliminary network training in order to receive embeddings;
3) if you need to search by static dataset, then HNSW is a great option, but the index will be built for a couple of weeks; if the dataset is dynamic, then faiss hasn't come up with anything yet, as far as I know.
When using faiss, I got the following results: the index is cooked for 5 hours, it takes 67 gigs, here is an estimate of the quality of the search on model data (BigANN, SIFT), index type IVF262k_HNSW32,PQ64:

[email protected]1    [email protected]10   [email protected]100   time (ms/query)
nprobe=16,efSearch=128,ht=246           0.6546  0.8006  0.8006     4.231
nprobe=32,efSearch=128,ht=246           0.7107  0.8818  0.8818     7.783
nprobe=64,efSearch=128,ht=246           0.7435  0.9343  0.9346    14.691
nprobe=128,efSearch=128,ht=246          0.7653  0.9687  0.9692    28.326
nprobe=256,efSearch=128,ht=246          0.7726  0.9829  0.9834    55.375

The metric shows the proportion of true top-1 vectors that fall into the top-k nearest ones when querying the index, with different search parameters, in total 10k queries.

D

Den S, 2019-06-17
@mvd19

How about simplifying?
So in the total commander there is a search for duplicates (by size first ... and then with the same size by content)
well, there is another option .. immediately read the checksum of the file and then just enter in Excel and the delete duplicates button + in the query master access, query on the difference between a table of a complete list and a table without duplicates