Classification of large texts through supervised learning

G

Guardian of the North2017-10-15 06:58:18

Neural networks

Guardian of the North, 2017-10-15 06:58:18

Classification of large texts through supervised learning - what approaches exist?

Hello comrades.
I recently started studying neural networks, but I’ve already caught fire on this topic, and I’m starting to think about my pet projects that came to mind.
Actually, I really want to make a neural network that classifies large (from 5 kB to 100-200 kB) texts in Russian into several previously known groups. The problem is that I can't find any information about supervised learning on large texts - those articles that I found describe cases of small (less than 1 kB) texts. Will these examples work when scaled by a factor of 100?
In addition, a secondary problem is that I do not quite know what additional difficulties the processing of texts in Russian, and not English, will bring. Cases, genders, and numbers are sure to confuse the web without special treatment.
And finally, I'm not entirely sure that such a task is within the power of my computing power. To study on (tens?) Thousands of texts ranging in size from a couple of kilobytes to a couple of hundred kilobytes - is this task within the power of an average computer, or do dozens of servers need to be on such a scale, and should I cut the bream with my capabilities?
Actually, I ask for advice from experienced comrades - is this task possible for me, and what approaches can be used to solve it? ..

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

�

⚡ Kotobotov ⚡, 2017-10-15
@OUGHT

What's the point in worrying about whether it will work or not? start doing it, no one will get worse from it.
As for the pitfalls, of course, if you take it as a fit (signal), for example, 1 word in the neural network, then by increasing the number of words you amplify this signal - texts where more words will give out large signal values, where fewer words - smaller values, so you you need to use the normalization of these signals, roughly speaking, for example, divide the signal size by the number of words in the text, for each text (well, it will be "more honest" to compare, texts of different volumes).
Regarding cases, gender - for this they use stemming, preliminary cleaning of the text from such specifics (bringing the text to a neutral form). Regarding English or Russian - there is no difference (only spamming needs to be done under Russian), the training itself does not make a difference.
For learning from texts, you can even use computers from the 90s, everything will work.

X

xmoonlight, 2017-10-15
@xmoonlight

I would like to know before answering:
1. What have you already read?
2. What have you tried to do?
3. What happened and where did you stop?

D

Dimonchik, 2017-10-15
@dimonchik2013

wanting is not harmful
, I still recommend going the classic way , mastering tools and concepts , this will save you from incorrect (and often idiotic) prerequisites
, for example, something, and a large text can be easily classified with the usual TF IDF + vector proximity