What is the idea of network optimization method optimization?

S

salmon47883482020-03-13 17:54:25

Mathematics

salmon4788348, 2020-03-13 17:54:25

I am a beginner in the study of neural networks, now I am getting acquainted with the theoretical minimum, such as backpropagation. I need to deal with some issue. Please help me understand what is required of me, and in which direction to dig.

Problem: Optimization in most cases is done by iterating the transformation F(X)=y, which takes an approximation of the solution x and returns the best approximation of y (for example, looking for the minimum energy E).
Let's assume that the function F is parameterized with the parameter w, for example, it is a neural network with weights w. We select the parameters w in such a way that the convergence rate is maximum. Those. we optimize a network optimization method using machine learning: network learn network.
To do this, you need to be able to calculate high-order derivatives of E, but fortunately for us, the problem is simple enough to do this.

Reply

Answer the question

In order to leave comments, you need to log in

2 answer(s)

D

dmshar, 2020-03-13
@dmshar

Something you have lost your way. Started right, then you skidded.
At the training stage of the neural network, we actually minimize some function F(x). This minimization consists in selecting the parameters of this function - w.
In order to select these parameters, you can go in different ways: from a simple "frontal" enumeration of all their possible values (of course - an absolutely inefficient way) to methods based on the idea of gradient descent. This class of methods - very roughly - is as follows. Being at a certain point, by calculating the values of the function with small changes in the specified parameter, it tries to understand in which direction it is really necessary to change the parameters in order to move towards the optimum point. Please note that in this - we minimize in w, and not in x.And about any "the network learns a network of speech does not go".
If this search (gradient descent) is done exactly as I described, the solution can be searched for a long time, there may be a "jump" through the optimum point and other situations that at least worsen the search time for the optimum, and sometimes even make it impossible. Various more advanced methods try to get around these situations.
backpropagation is just a way of "backward transfer" of the error - from the error fixed at the output of the trained neural network to the selected values w.
Again, there is no talk of "optimizing the network optimization method".
The fact that you asked such a question is very good. The bad thing is that such a distorted view occurs quite often. As a rule, those who are trying to “bite” the neural network right away, instead of going the normal way, figure out what optimization is, how it is implemented numerically, how it is applied, and finally, after all this, how it is used in neural networks. Alas, the costs of trying to deceive the normal path of (self-)education in the field of Machine Learning.

G

Griboks, 2020-03-13
@Griboks

Let's just represent the problem in matrix form.
Matrices X,Y are given.
There is some function F with parameter matrix W that transforms X->Y.
In the simplest case, the function F(X,W)=X*W=Y.
It would be cool to get the weight vector w from the weight matrix W in order to be able to transform an arbitrary x->y vector for one dataset.
There is a question of transformation W->w. This is exactly what the learning function H(W)=w does.
It is usually inductive: at the zero step, the initial weight vector w is selected, then -k*L(w*x;y) is added to it every iteration, where k is the learning rate coefficient, L(a,b) is the loss function between a and b.
We get H(W)={H[0]=w[0];H[i]=w[i-1]-k*L(w[i-1]*x[i];y[i]) }.
Your task is to transform the function H(W) so that the convergence is maximum (whatever that means). I think we are talking about the speed of convergence, the number of iterations (steps) of training or algorithmic operational complexity.