T
T
Termir9882017-04-12 00:30:32
C++ / C#
Termir988, 2017-04-12 00:30:32

C#. Which checksum algorithm to choose?

Task: compare two files and determine that they are copies of each other.
File type - unlimited. (most of them are project files in different ide, documents, etc. The number of files is unlimited. (tested for 10,000 pcs.)
As I understand it, the best solution to this problem would be to find checksums and compare them. The main criterion is speed, which algorithm costs
PS I tried CRC32 and MD5, MD turned out to be about 2 times faster, but I think my implementation of CRC32 was not the best ...

Answer the question

In order to leave comments, you need to log in

3 answer(s)
G
GavriKos, 2017-04-12
@Termir988

MD5 and CRC32 do not ensure the absence of collisions, so it is incorrect to use only checksums. At least compare also the size, and at first it.
In fact, I would choose the algorithm for which you do not need to write an implementation by hand. Because the task is to compare two files, and not write a checksum calculation.

D
d-stream, 2017-04-12
@d-stream

Actually, the table suggests itself in the form
"full file name"
"CRC"
"MD5"
and if the task does not prohibit sql, then the clone files will be perfectly found as

select * from table where MD5 in (
            select MD5 from table group by MD5 having count(*)>1
)
order by MD5

MD5 can be replaced by CRC32 or even combined as MD5 + CRC32 - the probability of a simultaneous collision in both CRC32 and MD5 will be obtained as the product of the collision probabilities for each of the algorithms - this is most likely enough for military space acceptance -)

A
AxisPod, 2017-04-13
@AxisPod

All hash functions are subject to collisions. It makes sense to compare the size, then the hash, and if they are equal, then compare the entire files.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question