How to classify text using neural network?

S

skiesx2016-11-25 22:04:08

Python

skiesx, 2016-11-25 22:04:08

Good day to all. There is a text (messages) and you need to classify them.
For example this fits:

Good evening. I am looking for a makeup artist for Sunday. Thank you.

Tonight you need to install a TV on the wall. Help the masters????)))))

Good afternoon! Dear parents, please tell me a good orthopedist!

But this doesn't work anymore:

Hello. Do we sell such garlands?

Please tell me where (not necessarily on the LCD) there is a good atelier. I need to change the fasteners on the fur coat.

1. It is necessary to understand whether it is suitable at all, that is, whether they are looking for some kind of person, a master. If something else, then it does not fit.
2. It is necessary to determine what type the message belongs to, what category, for example - medicine, beauty, cleaning, like that.
As I understand it, you need to create a neural network with a teacher. Feed her such messages and correct until the weights are adjusted more normally.
Please let me know if there is a solution to this problem somewhere. Preferably some kind of library in python. To teach her and then she will correctly classify the text. Thanks in advance.

Reply

Answer the question

In order to leave comments, you need to log in

5 answer(s)

R

Roman Mirilaczvili, 2016-11-26
@2ord

If there is little more than zero knowledge in computational linguistics, then the problem will not be solved soon ...
For some reason, the vast majority of developers amuse themselves with the hope that neural networks will magically solve any problem in the field of artificial intelligence.
And what examples to train a neural network? After all, not in plain text ... Like, you say to neural networks: "look, a person was mentioned here." So what? For a computer, text is just a sequence of bytes, so it cannot understand what is of interest to us humans. The computer needs to chew everything and explain it in the language of numbers.
After all, how do native speakers understand that the text is about people? There is a set of memorized words (dictionary) that a person, having read or heard, compares with his vocabulary and then decides which category the word belongs to in this context.
In order to understand, after analyzing the text, that “they are looking for some kind of person, masters”, it is necessary to isolate some key words taken from the dictionary: “I am looking for”, “help”, “prompt”, “advise”, “required”, etc. in combination with the mention of people (synonyms) and professions (professional dictionary).
A neural network is not needed at this stage. It will help in the classification when working with numbers, facts (Boolean logic). So before analyzing the text, you need to extract facts and connections from the text and then feed them to the classifier. In addition to neural networks, there are other types of classifiers that are simpler and easier to use, such as the Bayesian classifier . Neural networks can be both with training and without.
As an introductory part, it makes sense to start with the lecture Yandex - Small ShAD - Linguistics in Search.pdf
For the practical part: What is the Tomita-parser, how Yandex can be used with it ...
Perhaps the problem is solved easier and without neural networks.

A

Arseny Kravchenko, 2016-11-26
@Arseny_Info

1) Clean up the data (remove stop words, special characters, normalize, etc.)
2) Vectorize the data (bag of words, tf-idf, n-grams... )
3) Divide the sample into train/test .
4) Actually, train the classifier (don't start with neural networks, start with something simpler like random forest).
5) Do cross-validation, be horrified by the result, start fixing problems at every step.
A very basic tutorial scikit-learn.org/stable/tutorial/text_analytics/wo... Much less basic is nlp.stanford.edu/IR-book/.

R

rPman, 2016-11-25
@rPman

The text itself in its pure form is not suitable for the input of neural networks, you need criteria, the number of which does not change from test to test and the value of which is normalized (is within the limits, usually 0..1 or even -1.. +1) , criteria by values (stupid enum enumerations) can be either one criterion with fixed values (bad option, suitable for criteria that can be compared) or it can be vectors, the values \u200b\u200bof which are 0 and one of the elements is 1. The same requirements for the results of the neural network (with enums it's usually a probability vector)
Criteria for texts can be either a simple presence of keywords or phrases, or such unusual ones as the number of characters between punctuation marks, their number, the number of characters / words before a punctuation mark (for example, a question), etc. even the number of syntax errors is also a good criterion.
This is if you reinvent the wheel. I can’t tell you about existing solutions, as I haven’t done it myself yet.

X

xmoonlight, 2016-11-26
@xmoonlight

I

it2manager, 2016-11-29
@it2manager

To solve your problem, you need to study the Bayesian algorithm ( bazhenov.me/blog/2012/06/11/naive-bayes) and use the pymorphy2 library to highlight the normal form of words. Everything is implemented in a couple of days on phyton. Works with an accuracy of up to 90%, in the cases that you described.
P/S/ It works for us to automatically classify requests from users :) The classifier is quite branched. We did not use ready-made libraries for the algorithm - we wanted to play around ourselves. SQLlite is used as a database.