Answer the question
In order to leave comments, you need to log in
What text clustering method is better to use with a large number of topics?
Interesting algorithms that can show the probabilistic assessment of belonging to several clusters
"Label": "Science & Mathematics",
"Probability": 0.148,
"Label": "Astronomy & Space",
"Probability": 0.713 Does
anyone have a similar experience?
Answer the question
In order to leave comments, you need to log in
In general, here it is better to call it the classification of texts, and not clustering. A "similarity" metric, not a probabilistic score.
Although purely in colloquial language, I would also say that the probability that a document belongs to this class or category is as much and as much.
Clusters first need to be formed, and you are talking about ready-made categories. Which rather have a ready-made labeled collection.
In general, you are here https://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B4%D...
and if you do not have initial categories at all, then here
citeseerx.ist.psu.edu /viewdoc/download?doi=10.1.1....
It was correctly noted that if the labels for the text are given, then this is a classification. I advise you to start with logistic regression and tf-idf (optional, add bigrams and trigrams).
If there are no labels and you want to get a given number of them, then look towards latent Dirichlet placement or latent semantic analysis
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question