J
J
jxdoe6ff2014-01-09 20:39:18
Data storage
jxdoe6ff, 2014-01-09 20:39:18

How to check the integrity of a large number of files?

Check the integrity of a large number of files
Available:
1. A system with Windows XP SP2, you can boot from mandriva and mount the desired partition, but I'm a complete dummy in Linux.
2. Several thousand files downloaded from Internet libraries (if it plays a role, lib.ru, az.lib.ru, publ.lib.ru, ihtik.lib.ru, lib.rus.ec, flibusta.net, reeed. ru, litmir.net, gutenberg.org, feedbooks.com, manybooks.net, amazon.com, rutracker.org, how could it be without it...); files in various formats, dominated by fb2, pdf, djvu, rtf, epub, mobi, lit, azw and quite a bit of mp3.
In gigabytes, this is not so much, no more than 50 GB. Directories are more or less ordered by authors, libraries, nesting depth no more than 3.
3. All this wealth was downloaded through Download Accelerator Plus, μTorrent, Free Download Manager and other programs, the authors of which assure that all files are checked for integrity.
In reality, when the task arose to streamline everything, when manually opening the file, we find a fairly large percentage of broken files. The HDD is healthy, files are copied and moved with the same errors that were saved to disk.
Task:
How to identify broken files without bypassing them all manually?
For those who like to give advice “compare checksums”, a hint: you can compare TWO or more items available. First, if for several thousand files it is not always possible to establish where it was downloaded from, then what to compare? The caveat is due to the fact that over the past week I have pretty much googled and poyandexil the question, I heard a lot of recipes with checksums for a year ahead, but I didn’t find anything adequate, that’s why I’m turning to the respected community. Secondly, not all releasers, along with the file, post it md5 or sha, or at least something. Thirdly, if once the file was downloaded with an error (and I'm 99.95% sure that the problem is in the download, due to the fact that not all servers support download managers, resume, etc.), what is the probability that the second time the transmission will be error-free?
A couple of hints:
If Foxit Reader writes: “Format Error. Not a PDF or corrupted”, therefore, it does not need to compare anything with anything.
And the cheat I discovered (accidentally): ESET NOD 32 finds broken archives at once (only archives), giving the message: "the archive is damaged", also not comparing anything.
Sorry if there are too many letters, I tried to describe the problem in as much detail as possible, I won't mind if the moderators cut off what they consider superfluous.

Answer the question

In order to leave comments, you need to log in

2 answer(s)
R
Rsa97, 2014-01-10
@Rsa97

I see only one way out. Take a description of the format of the file you are interested in and check the file structure for compliance with the format. If the file contains checksums - you're lucky, if not, then you still won't be able to guarantee the integrity. The RTF format, as far as I remember, does not have checksums, and if the file structure is not damaged, then both "execute, cannot be pardoned" and "execute cannot be pardoned" are correct variants of the file.

I
Ilya Evseev, 2014-01-10
@IlyaEvseev

Archives can be checked - all popular formats have checksums, plus archivers can check the correctness of headers.
You can download twice, then compare - if the content matches, we can assume that it is the same on the server.
Any documents can be run through converters, for example, rtf2txt, pdf2txt, etc.
It was converted without errors - it means normal.
But there is a risk that the error will not be caused by a corrupted document, but by a crooked converter.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question