Is there more than 1 cluster?

S

Sergey Sokolov2018-03-20 11:33:54

Analytics

Sergey Sokolov, 2018-03-20 11:33:54

There is a one-dimensional set of values. For example:
[1, 2, 1, 0, 1, 1, 0, -1, 0, 21, 22]
It can have two clearly distinguishable clusters, as in the example - a pair of values [21,22] that are very different from the rest of the mass. It may not be.
How to analyze such data correctly, without manually setting the threshold? It is clear that even random data can be somehow divided into two groups, but there is not always sufficient reason to consider them as separate clusters.
upd. The task, it turns out, is typical: determining the number of clusters (in English)

Reply

Answer the question

In order to leave comments, you need to log in

4 answer(s)

D

dmshar, 2018-03-20
@dmshar

Actually, there is such a section of Data Mining as cluster analysis. And for solving problems like yours, probably tens of five different methods have been developed. Including - and "without manually setting the threshold" (I'll tell you a secret - you can do without any threshold setting at all).
Your case is the simplest, one-dimensional. In life, there are tasks that are much more difficult. But in any case, the choice of approach and a specific clustering method depends on the data - and on what scale they are measured, and how many of them, and whether their distribution is known, and what proximity measures can be introduced in the feature space. The question is also considered there, how it is JUSTIFIED to choose the number into which the sample is divided and how to identify anomalies-outliers. Finally, there are methods for comparing methods with each other.
Speaking specifically about your example, the data is so small and the clusters are so pronounced that there is no doubt about it. But if you are interested in the problem more deeply, then you cannot do without studying the theory. Often, clustering is studied as one of the areas within machine learning and is considered in the relevant books / courses. I can recommend as a "seed" -
https://habrahabr.ru/post/101338/
https://habrahabr.ru/company/ods/blog/325654/
Flach _ "Machine learning is the science and art of building algorithms",
Barseghyan "Data and Process Analysis"
and more serious sources:
Mandel "Cluster Analysis"
Kim "Factor, Discriminant and Cluster Analysis"
Mirkin "
Aggarwal, Chandan K. "Reddy-Data Clustering_ Algorithms and Applications"
et al. Sources on the topic - the sea.
Good luck.

C

codemania, 2018-03-20
@codemania

Standard deviation?

D

Danil, 2018-03-20
@DanilBaibak

The 3x sigma rule - 99% of the total distribution is less than 3*sigma, anything more can be considered an outlier.
https://basegroup.ru/community/glossary/3-sigma

A

Andrey Fedoseev, 2018-04-01
@itlen

Build a graph, look for bursts.