How to properly prepare data for network training?

D

DamskiyUgodnik2018-06-14 15:22:48

Neural networks

DamskiyUgodnik, 2018-06-14 15:22:48

Hello!
In the process of studying neural networks, a lot of questions have accumulated, the answers to which probably require more practical knowledge than theoretical ones.
Most of the articles describe standard examples (for example, the classification of a cat or a dog) or the equipment for the work of neurons, but there are practically no real examples of applied tasks used on user data and not ready-made datasets.
Hence the problem arises that after reading the article there are more questions than answers, therefore, if someone has practical experience that he is ready to share, I will be very glad to receive information on the following questions.

What amount of data for training is considered normal or how to understand that the problem is in the lack of data and not in the erroneous construction of the network?
By what logic is it customary to determine what exactly should be in the images, for example, if we want to determine whether the person in the picture has glasses. Should it be a set of photos with people wearing glasses (portrait photo), faces with glasses, or just an area of the image that has glasses?
From the previous paragraph, the question arises, how to correctly analyze to which feature from the dataset the network is attached, because it is possible, for example, that there will be many people with a similar nose shape and the network will identify "correct noses" and not glasses?
How strongly does the difference in the amount of data in different classes influence the classification? For example, if we "feed" the network 5000 images of cats and 3000 images of dogs and want to determine who is the cat or the dog in the image?

It is clear that all this can be tested by yourself, but I suspect that you can not reinvent the wheel :)

Reply

Answer the question

In order to leave comments, you need to log in

3 answer(s)

T

Therapyx, 2018-06-14
@Therapyx

1) From situation to situation. It cannot be specifically said that 10,000 million data will be better than one. To do this, they usually make statistics on changes in the algorithm and correct decisions. And on the charts it will be possible to determine the behavior of the line. If you think about and specifically answer your question, then imagine a graph that improves with the amount of data and at the end of the graph followed exactly the same sequence of improvements. From this we can conclude that it is worth adding even more data to a better result.
More data = not always better. Again, the graphs show deterioration.
2) glasses are enough, but it is better with people, because other elements will be taken into account (for example, nose, eyes, mouth)
3) There are a lot of problems in many aspects of ML and this is one of them)) But I kind of heard that there are already
well-trained models (though I don’t think they are free),
4) It doesn’t matter, here we are talking specifically about: Is it a dog This? Yes or no. If not. Is it a cat? Yes or no. And the better the model for a cat and a dog, the more accurate the results will be, but never hope for 1.00))

I

ivodopyanov, 2018-06-14
@ivodopyanov

1. The point is not only in the volume of the dataset, but also in its completeness. When the network is built incorrectly, nothing is usually learned at all. The insufficient size of the layers is easy to notice by simply increasing it and running the training again. In terms of volume - I saw somewhere about text classification that SVM gives the best result with a sample size somewhere from 2000 to 50000, neural networks - from 50000.
3. This is called the interpretability of the neural network / algorithm. There are some studies in this direction (how to ensure that by looking at the activation of neurons to understand what and why the NS does), but there seem to be no serious decisions. The best way is a good test suite that will show error clusters.
4. Everything strongly depends on the complexity of each specific task. There is no ready formula.

A

Artem, 2018-07-17
@artymail

Most of the articles describe standard examples (for example, the classification of a cat or a dog) or the equipment for the work of neurons, but there are practically no real examples of applied tasks used on user data and not ready-made datasets.

And you start reading books for a start, and not copy-pasted articles. Start at least with this "Building a Neural Network" by Tariq Rashid. Get answers to some of your questions.