Answer the question
In order to leave comments, you need to log in
How to compare whether there is already such a file in the database or not?
Good afternoon!
Help with the task, please.
There are several hundred thousand PDF files. New files are constantly added and it is necessary to quickly determine whether there is already such a file among the existing ones or not, with a coincidence accuracy of, say, 99%. Therefore, the hashes do not fit.
How would you approach solving such a problem? What are the ready-made tools for this? Ideal if under node.js
Answer the question
In order to leave comments, you need to log in
Keep a database of MD5 and SHA1 hashes for all files, when adding a new file, we calculate these hashes for it and look at the database if we have a file. for which both hashes matched, if there is, then such a file has already been uploaded and you can not upload it again.
with an accuracy of coincidence, let's say 99%.
There are several hundred thousand PDF files.
Obviously pdf -> picture -> one of the many image search technologies, Habré is full of articles.
That's just the task itself - nonsense. PDF is a document, documents have a certain format and are similar to each other by default. The difference between the document and the same document with the signature of the CEO is technically minimal, practically absolute.
So get that out of your head and use hashes.
You can use something like a perceptual hash, and to determine the similarity between a document and a set of others, use a DBMS search using the Hamming distance.
According to the description of the problem, it echoes the question How do perceptual hashes compare?
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question