Answer the question
In order to leave comments, you need to log in
How to determine if a text belongs to a topic using python?
Imagine this situation: there is a text, you need to determine which topic from the list of topics it belongs to.
What is? There is a list of topics where each topic is a set of words, for example:
football: football, ball, field, fan, gate, referee, goalkeeper, football player ...
biathlon: biathlon, rifle, skiing, skier, biathlete, snow, target ...
...
There is also some text that in one way or another relates to topics defined earlier: that is, it belongs to some topic based on some measure, let it be from 0 to 1.
So, I assume that this problem is solved by latent semantic analysis. There are articles on this topic, but usually their examples are reduced to problems of obtaining the main theme of a text or comparing the proximity of two texts. I did not find such a problem as is.
Everyone has libraries for python that will solve my problem, and everyone among us has those who have already solved it. Tell me what was your experience, what guided you.
Answer the question
In order to leave comments, you need to log in
1) Classification task (supervised) or clustering (unsupervised)
2) Bag of words method
3) scikit-learn.org
Well, not exactly on a hangout, but there is - Elasticsearch and its percolator query - https://www.elastic.co/guide/en/elasticsearch/refe...
The meaning of everything is this - we have ES and there is an index with our queries "football: football, ball, field, fan, gate, referee, goalkeeper, football player"
We take a document and ask ES via percolator query what queries this document corresponds to. In response, ES will return the most relevant queries.
You can communicate with ES through python.
This is how we organize products into catalogs.
A complete example of how to do this is in scikit-learn scikit-learn.org/stable/tutorial/text_analytics/wo...
Google machine learning text classification, text categorization.
if it’s completely head-on:
1) you plan nouns from sentences (Noun, there is also main Noun)
2) then Counter.most_common() counts them in the entire text and takes the first N
3) and the cosine proximity of these N with the nouns of the topic
In text clustering, there is also the BigARTM library from Vorontsov, which allows you to "grow" clusters around sets of predefined words.
Didn't find what you were looking for?
Ask your questionAsk a Question
731 491 924 answers to any question