How to compare whether there is already such a file in the database or not?

A

AstonMartin2021-06-23 15:56:14

Algorithms

AstonMartin, 2021-06-23 15:56:14

Good afternoon!

Help with the task, please.

There are several hundred thousand PDF files. New files are constantly added and it is necessary to quickly determine whether there is already such a file among the existing ones or not, with a coincidence accuracy of, say, 99%. Therefore, the hashes do not fit.

How would you approach solving such a problem? What are the ready-made tools for this? Ideal if under node.js

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

S

Satisfied IT, 2021-06-23
specialist @borisdenis

Keep a database of MD5 and SHA1 hashes for all files, when adding a new file, we calculate these hashes for it and look at the database if we have a file. for which both hashes matched, if there is, then such a file has already been uploaded and you can not upload it again.

1

12rbah, 2021-06-23
@12rbah

with an accuracy of coincidence, let's say 99%.

Here is a look How to compare two texts in JS? . https://stackoverflow.com/questions/5042873/javasc...
You can also do a hash check in your case, because pdf is rarely edited and everyone is submitting the same book.

There are several hundred thousand PDF files.

Is this a real problem or just something to do? Too little information on the documents themselves. in some areas where there are many standard reports, this is a common thing when a document is 95% similar to another, for example, only one digit in a date can change, and such a document cannot be rejected. And you can't always extract text from PDF for comparison. In general, quite a lot of pitfalls. Therefore, it is worthwhile to be more specific with the task.
upd: in general, there is still such a feature that you need to choose the right tool for extracting text, because. many tools do not always extract all of the text. Well, it’s also interesting how quickly you compare 1 pdf with 200-300k other pdfs, as for me the costs of this process will be too high. You can of course compare only part of the text. By the way, extracting text from pdf is a slow process, text from some pages can be extracted for more than a second on ordinary processes (tested on documents of 600-700 pages), I used only non-commercial solutions, maybe you will find something faster, but parsing pdf is up to you obviously not on the node, because it will be too slow.

A

Aetae, 2021-06-23
@Aetae

Obviously pdf -> picture -> one of the many image search technologies, Habré is full of articles.
That's just the task itself - nonsense. PDF is a document, documents have a certain format and are similar to each other by default. The difference between the document and the same document with the signature of the CEO is technically minimal, practically absolute.
So get that out of your head and use hashes.

R

Roman Mirilaczvili, 2021-06-24
@2ord

You can use something like a perceptual hash, and to determine the similarity between a document and a set of others, use a DBMS search using the Hamming distance.
According to the description of the problem, it echoes the question How do perceptual hashes compare?