How to determine if a text belongs to a topic using python?

G

Gudsaf2017-11-09 09:42:22

Python

Gudsaf, 2017-11-09 09:42:22

Imagine this situation: there is a text, you need to determine which topic from the list of topics it belongs to.
What is? There is a list of topics where each topic is a set of words, for example:
football: football, ball, field, fan, gate, referee, goalkeeper, football player ...
biathlon: biathlon, rifle, skiing, skier, biathlete, snow, target ...
...
There is also some text that in one way or another relates to topics defined earlier: that is, it belongs to some topic based on some measure, let it be from 0 to 1.
So, I assume that this problem is solved by latent semantic analysis. There are articles on this topic, but usually their examples are reduced to problems of obtaining the main theme of a text or comparing the proximity of two texts. I did not find such a problem as is.
Everyone has libraries for python that will solve my problem, and everyone among us has those who have already solved it. Tell me what was your experience, what guided you.

Reply

Answer the question

In order to leave comments, you need to log in

7 answer(s)

K

kzoper, 2017-11-09
@kzoper

www.nltk.org
scikit-learn.org

S

Sergey Nizhny Novgorod, 2017-11-09
@Terras

1) Classification task (supervised) or clustering (unsupervised)
2) Bag of words method
3) scikit-learn.org

A

Alexey Cheremisin, 2017-11-09
@leahch

Well, not exactly on a hangout, but there is - Elasticsearch and its percolator query - https://www.elastic.co/guide/en/elasticsearch/refe...
The meaning of everything is this - we have ES and there is an index with our queries "football: football, ball, field, fan, gate, referee, goalkeeper, football player"
We take a document and ask ES via percolator query what queries this document corresponds to. In response, ES will return the most relevant queries.
You can communicate with ES through python.
This is how we organize products into catalogs.

A

asd111, 2017-11-09
@asd111

A complete example of how to do this is in scikit-learn scikit-learn.org/stable/tutorial/text_analytics/wo...
Google machine learning text classification, text categorization.

A

Andrey Fedoseev, 2017-11-09
@itlen

LSI and search in python

D

Dimonchik, 2017-11-09
@dimonchik2013

if it’s completely head-on:
1) you plan nouns from sentences (Noun, there is also main Noun)
2) then Counter.most_common() counts them in the entire text and takes the first N
3) and the cosine proximity of these N with the nouns of the topic

I

ivodopyanov, 2017-11-10
@ivodopyanov

In text clustering, there is also the BigARTM library from Vorontsov, which allows you to "grow" clusters around sets of predefined words.