How to properly generate data for a neural network?

R

retr02020-08-01 18:56:55

Neural networks

retr0, 2020-08-01 18:56:55

For the sake of interest, I decided to try to write a primitive neural network to classify text sentences by type (Question, statement, joke, etc.). Manually stumbled about 400 messages to train the network, each was assigned the appropriate type. Faced a problem in network modeling, tk. to form the input layer, I decided to use the database of Russian words in all morphological forms (And there are about 1,500,000 words). That is, the number of input neurons is equal to the number of words in the base, and their values are either 0 (if the word from the base is not in the message) or 1 (if the word from the base is in the message).
And obviously I was faced with the fact that my PC is unable to overpower such work, like any PC in the world, I suppose) Therefore, it became interesting how smart people act in such situations, for example, when you need to process an image of very high quality and with high resolution .
In advance, I ask you not to throw tomatoes at me, as I am only mastering the topic and understand it superficially in places)

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

F

freeExec, 2020-08-01
@retr0

I don’t know why you didn’t look for information yourself, it’s just a sea.
But still, a brief squeeze is this:
Concerning the text. Yes, words are encoded, but not by an ordinary number, but by a vector of large dimensions, for example, a twelve-dimensional space, it is almost impossible to imagine. The neuron is also involved in the creation of such a vector. However, such vectors have long been created and can be downloaded. There are even for the Russian language. Their main feature is that if we subtract the "man" vector from the "woman" vector and add this delta to the "king" vector, we get the "queen" vector.
Secondly, neural networks in "memory" are used for text processing. Those. the next word from the sentence with a certain state from the previous pass is fed to the input.
About images - no one stuffs a 4K photo into a grid. Either cut the image into small pieces, or compress. Let me remind you that the first cifar-10 image classification competitions were on 16x16 pixel images. For example, 224x224 pixel images are fed to the input of the VGG network.