Where am I wrong in the text classification algorithm?

O

Oleg Bezverkhy2016-05-07 04:47:21

Classification

Oleg Bezverkhy, 2016-05-07 04:47:21

Hello. I have such a problem. Due to the scarcity of information on the classification of text messages in Russian, some questions arose and a clear algorithm of actions is not fully understood.
Given - csv file with requests (10000), which should be divided into categories. I understand this algorithm:
1) We take the file and carry out normalization - we remove stop words and punctuation marks, we bring all the words to a single form (or, correctly, to the initial form, i.e. stemming is performed). Then we divide the entire sample into test and training (30 to 70).
2) It turns out that we manually mark the case by category? Or can TF-IDF be used to highlight frequently occurring words?
3) We translate the words into vector form. Here, too, the question is - which is better? Use Bag of Words? It turns out for each request to build a separate vector with the words found in them, or do it immediately for the entire category (or possibly for the entire sample)? That is, at the output we should get several vectors or one large vector (with frequently occurring words?) For the whole category?
4) Submit the resulting vector/s as input to one of the classification algorithms. We train him.
5) We take a request from the test sample, also bring it into normal form and feed it to the input of the algorithm and look at the answer.
It seems like something like that. And the last question right away - in order not to train the classifier 10 times and not store everything in memory, is it possible to somehow (for example, if we take a neural network) record the weights and just distribute them when loading and it’s ready, or will you have to undergo training every time? Thank you all in advance

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

O

Optimus, 2016-05-07
Pyan @marrk2

What is the purpose of a classifier? By what principle and into what groups should he classify words at the end?