T
T
tronix2862013-10-04 22:56:15
tronix286, 2013-10-04 22:56:15

8-bit checksum or hash function

Greetings.
Many strings are being compared (millions and billions). The lines are not sorted, the size always jumps. Before comparing a string with a string, the size is compared. Next, there is a free byte (8 bits), which I want to use as a kind of checksum or like a hash function in order to speed up the comparison process. Please advise something?

I tried CRC8 - a lot of incorrect hits when the strings are different and the CRC is the same. So far I have settled on a simple sum of all the bytes in a string, but maybe there is something better?

Answer the question

In order to leave comments, you need to log in

7 answer(s)
D
Denis Zagaevsky, 2013-10-04
@zagayevskiy

8 bits is just 256 different values. Of course, when you map millions to 256, there are a lot of collisions, obviously. No matter how good a function you choose, there will still be only 256 different hashes.

V
vsespb, 2013-10-05
@vsespb

Try the last one (edge) byte of some cryptographic hash function like md5

M
Max, 2013-10-04
@7workers

you should use this byte as an additional check anyway. that is, if the length matched, CRC8 matched, only then check bit by bit. this will win if you have few matches compared to the total and strings often match at the beginning. Alternatively, try to compare bit by bit backwards if the first character matches.

R
rozhik, 2013-10-05
@rozhik

If we talk about English text (as in other things and about any textual information), then CRC is not effective (especially for short strings). As an example: a simple hash function sum( str[i] + str[i+1] << 4) for 0 and even i has an order of magnitude more uniform distribution on short strings.
But in general see http://www.cse.yorku.ca/~oz/hash.html . There are several smart functions here - evaluate according to your data which one is better.

V
Vladimir Martyanov, 2013-10-05
@vilgeforce

x = 0
for i range(len(inData)):
x = x^inData[i]
x = _rotl(x,7)
try something like this… A mixture of serpentine and visual :-) _rotl is a cyclic shift.

M
mayorovp, 2013-10-06
@mayorovp

Author, there are not so many options:
1. CRC8
2. Polynomial hash (it's nice to try several options with different odd multipliers).
3. A hash based on the cyclic shift operation (there are few options here - you should shift by 3 or 5, you can use addition or xor).
All these options can be tested sequentially and choose the best one.

D
dimarick, 2013-10-06
@dimarick

CRC8 gives a fairly good distribution at a good speed.
If the number of collisions is unacceptable, then increase the bit depth of the hash. Other algorithms will not significantly reduce the number of collisions.

Similar questions

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question