How to divide "weights" into clusters CORRECTLY?

X

xmoonlight2021-01-12 18:37:22

Clustering

xmoonlight, 2021-01-12 18:37:22

There are meanings of "weights" of words:
1, 3, 5, 6, 8, 8.7, 9.5, 10, 12.
And how to divide them into clusters CORRECTLY?

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

W

Wataru, 2021-01-12
@xmoonlight

First you need to decide if you need a fixed number of clusters or a variable number. Then you need to come up with a metric that tells which clustering is better than the other.
Metric options:
- For each cluster, the largest distance between two elements is considered, and this is summed over all clusters. It is possible to sum the squares of these distances, then clusterings with very large clusters will be penalized.
- the ratio of the maximum distance between neighboring points in any cluster and the minimum distance between clusters.
- It can be a qualitative metric as well. Any clustering where the distance between neighboring points in a cluster is less than the distance between clusters is considered good. This is a special case of the previous metric, but you just need to look not for the minimum, but for any value <1.
Some metrics only make sense with a fixed number of clusters, like the first one.
Different metrics give different clusterings, and they are all good in some way. What exactly suits you in your task - you can only judge empirically.
On the line, you can optimize it quite quickly. For example, the third metric is generally solved by greed - we sort all the segments between neighboring points by length and eagerly merge the clusters until there is the required number.
Many metrics, if they are additive like the first one, can be considered dynamic programming: f(i,k) is the value of the metric if we split the first i points into k clusters.
For others, as for the second one, you will have to mix the answer dichotomy and dynamic programming (binary search by answer, then we check if there is a split with the same or better metric. Inside the dynamics, the minimum achievable maximum value between adjacent points in the class among the first i when splitting into k clusters.When sorting through the last cluster, you need to make sure that the distance between it and its neighbors does not exceed the response of the dynamics divided by the coefficient being sorted out).
You can also use standard methods without optimizations based on the fact that we have a one-dimensional space - stupidly use the k nearest neighbors method, for example.
You will have to try different methods on your real data and choose what works best.