M
M
MIsternik2016-12-19 22:01:47
Mathematics
MIsternik, 2016-12-19 22:01:47

What text clustering method is better to use with a large number of topics?

Interesting algorithms that can show the probabilistic assessment of belonging to several clusters
"Label": "Science & Mathematics",
"Probability": 0.148,
"Label": "Astronomy & Space",
"Probability": 0.713 Does
anyone have a similar experience?

Answer the question

In order to leave comments, you need to log in

2 answer(s)
A
al_gon, 2016-12-19
@al_gon

In general, here it is better to call it the classification of texts, and not clustering. A "similarity" metric, not a probabilistic score.
Although purely in colloquial language, I would also say that the probability that a document belongs to this class or category is as much and as much.
Clusters first need to be formed, and you are talking about ready-made categories. Which rather have a ready-made labeled collection.
In general, you are here https://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B4%D...
and if you do not have initial categories at all, then here
citeseerx.ist.psu.edu /viewdoc/download?doi=10.1.1....

V
Vlad_Fedorenko, 2016-12-19
@Vlad_Fedorenko

It was correctly noted that if the labels for the text are given, then this is a classification. I advise you to start with logistic regression and tf-idf (optional, add bigrams and trigrams).
If there are no labels and you want to get a given number of them, then look towards latent Dirichlet placement or latent semantic analysis

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question