E
E
egens2012-06-28 15:00:00
Programming
egens, 2012-06-28 15:00:00

How to improve sentiment analysis of text in Russian?

It is required to analyze the sentiment of the text on a set of user comments on a certain topic. At the moment, it has been decided to classify into three classes - negative, neutral and positive tonality. 1500 comments are manually marked up for the study of algorithms. The class sizes in the test sample differ by no more than two times. Following the example of foreign colleagues, the support vector machine was applied in the space of binary features indicating the presence of words in the comments. The classification accuracy is less than 60%. Sentiment analysis of English texts showed an efficiency of 80%.
One of the supposedly significant problems is the numerous errors in the comments, both spelling and grammatical. The greater complexity of the Russian language is also significant. And a small number of open tools for the analysis of the Russian language. I write the code in Python, I managed to find only the implementation of Porter's stemmer, as well as the pymorphy morphology analysis library .
I would appreciate advice of any kind. Are there other convenient and proven tools for analyzing the Russian language, preferably with a Python implementation? Is the choice of SVM as a classification algorithm correct, maybe there are more efficient classifiers? Are more efficient feature spaces known?

Answer the question

In order to leave comments, you need to log in

4 answer(s)
L
lightcaster, 2012-06-29
@lightcaster

I did this at work. Try:
- svm with linear kernel
- (1,2,3)-grams as features
- normalize the text but try leaving some punctuation - !?
- worked with English, where the stemmer did not improve the quality. In Russian, you need to try, but I won’t be surprised if it doesn’t work.
In addition, you can play around with various transformations on vectors. It did not help me, although theoretically it should have. Maybe he did something wrong. Try LSA (pLSA if the dictionary is large).
Notes:
- good corpus is very important
- feature selection is important
- spelling doesn't have a strong effect
- deleting stop words didn't improve quality either
- indeed, irony, ambiguity is very difficult to catch by this method; when trying to catch long connections, the quality decreases.
If you want extreme sports, you can try some classifier with a string core. I tried, didn't work. But theoretically it can work.

I
Irokez, 2012-06-29
@Irokez

About the classifier:
SVM is often used to analyze the sentiment of the text, so the choice is correct. There is no exact answer which classification algorithm will work better - experiment with different algorithms and choose the one that gives the best results. Offhand, I will suggest Naive Bayes and MaxEnt.
About the signs:
As you were correctly suggested above - try digrams, and also try 2-3-4 letter grams, this can help with the spelling problem. Regarding the construction of the vector, there are more efficient weighting functions for features than the binary one. I usually use delta tf-idf. As additional features, you can try morphological tags (part-of-speech tag), it happens that they help. Also, combinations of words with tags sometimes help (eg: I-pronouns, love-verb, tea-n.)
About the data:
1500 comments - is this a test sample or data for training the model? Or all together? You will need much more data for training. Depending on the subject, they can be collected from certain sites (if films, then, for example, film search).
You can also make a tone dictionary, a list of words with their tone value (affective lexicons). Either manually, or translate from other languages ​​(for example, from English) - manually or automatically, or one of the methods for automatically compiling a sentiment dictionary. In general, the task of classifying into three classes is quite difficult. Try first to do for two classes, and then either add extra. classifier or expand the model for three classes. There are many options here. Good luck!

K
kmike, 2012-06-28
@kmike

I warn you: I have no practical experience here.
You can watch lectures here: class.coursera.org/nlp/lecture/preview (Sentiment Analyzis). The critical thing in this whole thing, as far as I understand, is the definition of negation (like vs dislike), which cannot be done with the “bag of words” approach. For the English language, fairly simple approaches work (“to all words after a negative word ('not', for example), attribute “-” until a punctuation mark is encountered - and apply classification through bag of words to the words processed in this way” ).
The choice of features for classification should have the greatest influence on the result (the SVM classifier vs something else - + - a few percent maximum will be).
Here is another paper on the topic that may be useful: www.dialog-21.ru/digests/dialog2011/materials/ru/pdf/50.pdf

I
ilya_volodin, 2012-06-29
@ilya_volodin

Interesting puzzle. I've been chatting with my wife on Skype. And she got offended. I would say the phrase that I wrote very positively (and when I wrote it I thought so), but she read it very negatively, I later re-read it myself, and you can really understand it this way and that way.

Didn't find what you were looking for?

Ask your question

Ask a Question

731 491 924 answers to any question