How to organize clustering of tens of millions of texts?

A

andyN2014-01-03 08:04:50

data mining

andyN, 2014-01-03 08:04:50

Happy new year, ladies and gentlemen!
At work , it suddenly needed to classify several tens of millions of articles. In other words, determine the category of input texts. The output should be something like:
"Article 1" -> {"Politics": 99%, "Society": 85%}
"Article 2" -> {"Sports": 58%, "Society": 13%}
AND etc. Here "Article 1" is the title of the article (it also has text, of course). The data in curly brackets is a couple of maximum categories. Those. it is necessary to calculate which category the article most likely belongs to, and this coefficient of "probability", or rather "similarity", is expressed as a percentage.
There will be dozens of categories.
I'm not familiar with machine learning, unfortunately. Logically I understand that we need a test sample to "train" the clusterer. We have a test sample - tens of thousands of articles that already have a certain category (news site). Those. the entire software package must first run through the test sample, "learn" (identify some marker words), and then classify our "raw" data.
I have time to study, watch long videos, write my own code, etc. - no. Unfortunately, because the topic is interesting. But the fact remains - I beg you to throw off links to ready-made solutions for this task (clusterizers) so that you don't have to spend a lot of time on finishing. Surely something similar has already been written and posted on github or somewhere else. But unfortunately I didn't google anything.
I really hope for the help of the habra community, which has already helped me out more than once.

Reply

Answer the question

In order to leave comments, you need to log in

6 answer(s)

A

andyN, 2014-01-03
@andyN

PS The task looks like a classic for neural networks, but 1) my practical knowledge in this area tends to zero, 2) I am afraid that the neural network will work too slowly.

A

Andrew, 2014-01-03
@OLS

The problem of deviating an article from a corpus. It is solved by literally one formula in the presence of ready-made corpora on the subject (and you have them, as I understand it):
habrahabr.ru/post/204104
(instead of NKRP - a thematic selection of already classified articles,
instead of Habr - a specific article under study)
I can offer a program code , unfortunately in retro language.

A

Andrey Belov, 2014-01-03
@Andrey_Belov

You need classification, not clustering, they are different methods. That is, if you google, then look for it about automatic classification. As a ready solution, you can try to take something like Weka ( www.cs.waikato.ac.nz/~ml/weka ), but I don't know how well it works with Russian texts.

S

Sergey, 2014-01-06
@begemot_sun

And try my bike: https://github.com/loguntsov/bayes

I

iHun, 2014-08-18
@iHun

For similar tasks, I use the LDA algorithm implemented in gensim. A certain number of topics are automatically created and the probability with which each document belongs to any topic is calculated

X

xmoonlight, 2016-02-10
@xmoonlight

andyN is neither similarity nor clustering, but this is SEGMENTATION.
It is done like this: a word is taken and put according to all the necessary types of weight.
For example, "president":
[politics]:0.5
[law]:0.5
[society]: 0.4
[leisure]:0.1
[children]:0.1
, etc. for each ROOT word. Also, a dictionary of synonyms is being made, which will link synonymous words to known weights in the table. ROOT repetitions - do not count when summing the weight.
After that, we transform the text according to the dictionary of synonyms and then calculate the weights for each category.
Profit!