Answer the question
In order to leave comments, you need to log in
What is the best algorithm to choose for clustering a large amount of data?
I would like to find an algorithm that will divide a set of 10 million rows into about 12 clusters. For example, the most famous kMeans in this case will not work due to the exponential complexity of the algorithm. Is there any algorithm for this amount of data, but at the same time, the execution of which will take an adequate amount of time?
Answer the question
In order to leave comments, you need to log in
You did not tell the main thing - how many parameters describe your data?
With two or three parameters, the time is unlikely to be catastrophically long.
However.
Try DBSCAN for example. It does not require processing all the data at every step. Its computational complexity is O(NlogN), in the worst case - O(N**2). Here https://habr.com/ru/post/322034/
it is recommended for the case when you have data of the order of 10**6 and even more, if you can parallelize the implementation.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question