How to do hash based clustering?

T

tsvikm2015-02-04 04:12:39

data mining

tsvikm, 2015-02-04 04:12:39

Given:
There are many inputs, let it be 1TB. Let's represent them in the form of a binary code. The resulting binary sequence is divided into parts of length N. That is, there will be a very large number of binary sequences (hereinafter segments) of a fixed length N (length N can be different, from 8 to 65536).
Purpose:
It is necessary to select K clusters, each of which will contain similar segments (the degree of similarity must be set).
Obviously, representing them as N-dimensional vectors and doing classical clustering by distance (Hamming, Euclid, etc.) will be too expensive, because you will have to calculate the distances between all vectors (bitwise comparison), which are very many. We need some algorithm that will allow clustering by hash codes. That is, you first need a hash function that will give similar hash codes to similar vectors. Or such a hash function is not necessarily needed, but only an algorithm is needed that will determine the degree of similarity of vectors through their hash codes. Any ideas how to do it? Or suggest some algorithms. There are such, but in Russian I did not find a description suitable for this task, and in English I still cannot digest such literature. Thanks in advance.
upd. I found something that seems to fit, but I can't figure it out (technical English suffers): roussev.net/pubs/2010-IFIP--sdhash-design.pdf
Rabin Fingerprinting, Fuzzy Hashing and how it works are covered here. Does anyone know similar articles in Russian or can help with adapting this one?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

Sergey, 2015-02-04
@begemot_sun

What is your understanding of the degree of similarity of vectors?
If you take a hash from the data, then it has nothing to do with "vector similarity".
For hash clustering, it's enough to use the hash as a 32 (64, or other) bit number, and take the remainder of the division by the number of groups you need.

X

xmoonlight, 2016-02-10
@xmoonlight

Let binary sequences be sentences.
And the set of characteristics in them is words.
Let the separator be a space character.
Then using this algorithm How to determine the similarity of two strings?
we can select the records we want with the least performance cost.