What is the best algorithm to choose for clustering a large amount of data?

O

olav242021-02-19 19:44:04

Clustering

olav24, 2021-02-19 19:44:04

I would like to find an algorithm that will divide a set of 10 million rows into about 12 clusters. For example, the most famous kMeans in this case will not work due to the exponential complexity of the algorithm. Is there any algorithm for this amount of data, but at the same time, the execution of which will take an adequate amount of time?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

dmshar, 2021-02-19
@dmshar

You did not tell the main thing - how many parameters describe your data?
With two or three parameters, the time is unlikely to be catastrophically long.
However.
Try DBSCAN for example. It does not require processing all the data at every step. Its computational complexity is O(NlogN), in the worst case - O(N**2). Here https://habr.com/ru/post/322034/
it is recommended for the case when you have data of the order of 10**6 and even more, if you can parallelize the implementation.

Z

zexer, 2021-02-19
@zexer

Have you already tried using k means or just assumed that it would take a long time?