What to use for learning in python?

L

lexstile2020-04-28 15:44:13

Python

lexstile, 2020-04-28 15:44:13

There are 2 databases of texts, some are suitable, the latter are not.
It is necessary to train the network so that it can determine which text is suitable and which is not (Russian / English).
I need the network to distinguish the presentation (how the texts are written) of the necessary texts from those that are not suitable.
Is it possible to solve a similar problem using python libraries? Which one is better to use?

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

S

Sergey Pankov, 2020-04-28
@lexstile

so that the network distinguishes the feed (how the texts are written)

Holy innocence!
According to such a clear statement, it is a sin not to solve the problem in one line at all.
You will not find a ready-made library for "comparing text flow (as written)".
Try to calculate the total statistics of N-grams for all suitable and unsuitable texts. Take the top of this statistic (the m most frequently used n-grams in the tagged text corpus), fix the sorting of the n-grams, and generate m-dimensional vectors for each text.
Train a multilayer neuron on the resulting vectors.
Experiment with hidden layers, vector size m, training sample size to avoid overfitting and get a good correct response rate.
If the corpora are small, try to mutate them by mixing sentence-break texts. But do not expect miracles from this, you will not dramatically increase the sample in this way. By some percentages, maybe improve the forecast.
And so Yandex has some kind of toolkit.
Here is also https://www.nltk.org/
Libu for python neurons is not a problem to find at all. Take any with which it is easier to get used to.
You can try to play around with the normalization of words before vectorization, but important things can be lost along with the endings: familiarity, cases, faces, participles and other phrases.
In a good way, you need to give the texts a philologist to read, so that he makes a conclusion on what main criteria the corpus is divided into. Even if the criteria are fuzzy, you may understand whether normalization is acceptable, what fixed criteria to add ...
It may be effective to add swear words, colloquial expressions, neologisms, signs of the presence of compound sentences, signs of overcomplicated word formation, etc. d.
This, depending on the purpose of your system, can help or hinder.
Consult with experts. You can’t shower everything in the world with neurons.

Y

yuraafanasiev, 2020-04-28
@yuraafanasiev

Guy, of course I understand, maybe you are not yet experienced enough in this field of programming and many will rush at me with bared teeth and arguments about my illiteracy, but I will give you advice, just take one line from the text, then iterate over the line character by character using loop and during the iteration, check if the character/s in the string do not match with English letters (you can create a list with separately written letters 'a', 'b', 'c', etc.). I hope I helped you) In fact, this is a very perverse way, but nevertheless a working one :)
If you need help, you can contact here: vk.com/yuraafanasiev almost always in touch