Clustering Similar Arrays

A

axmakarov2014-02-04 23:18:20

data mining

axmakarov, 2014-02-04 23:18:20

Greetings! I faced such a problem: I am faced with the task of fuzzy clustering of certain user requests based on the similarity of the people who asked them. In my data, each request has a one-to-one correspondence with an array of people who contacted the system with such a request. For a better illustration, I will give an example ( xls) . In the file, each Query request is assigned an array of 25 dimensions, but in practice the dimension may vary, it all depends on how many people applied with such a request. It is necessary to perform fuzzy clustering based on the degree of array similarity. My question is what is the best clustering algorithm to use for this task, perhaps as part of existing Data Mining libraries (in C # or Python), and where to start, for example, how to calculate the distance between objects.

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

A

Andrew, 2014-02-04
@OLS

If the amount of data allows, count the number of total users "C" between the analyzed query and the reference one. If the length of the initial samples (25 in your example) can vary greatly (let's denote "N[0]" for the analyzed one and "N[i]" for the reference one), then it probably makes sense to normalize this number, for example, "2*C /(N[0]+N[i])" or "C/SQRT(N[0]*N[i])" or "C/N[0]+C/N[i]".
If the amount of data does not allow ("C" is statistically close to "0"), then it may be justified to "expand the circle" of the researched query and the reference query by including in their samples (naturally already with weights that reflect frequencies) other queries that were of interest people who were interested in research and reference queries (separately). Whether such a maneuver will pass depends on the subject area, that is, the semantic relationships between users and queries.

N

niosus, 2014-02-05
@niosus

As far as I understand, your only problem is to set the norm of similarity of two arrays, so to speak "distance". After that, you can use any krasterization algorithm, for example, k-means or fuzzy c-means (more complex ones are worth looking at if these two do not work). Their implementation should be in any self-respecting clustering and data processing library.
Regarding this very norm, I, in general, was going to propose approximately the same as @OLS
I would suggest that you store users in a sorted form, then you can find the "distance" between any two vectors in linear time.

X

xmoonlight, 2016-02-10
@xmoonlight

Look here: How to determine the similarity of two strings?