How to learn with reinforcement?

P

Prizm2021-04-03 00:13:29

Machine learning

Prizm, 2021-04-03 00:13:29

I tried to write an algorithm for training a bot that chooses the best future state of the game and "walks" into it.
The algorithm looks something like this:
1. play N games, making mistakes with a chance of eps.
2. write each game as two sequences of states on behalf of two players, i.e. there are even numbers in one sequence, and odd numbers in the other. (I "turn over" the final state and put it in the list opposite from the list with the usual final state).
3. for each state, calculate the corresponding value v'[i] = v[i] + alpha*(r[i] + gamma * v[i+1] - v[i]), where v = estimate the state using the built-in neurons into the bot, and r[i] is the reward corresponding to the state.
4. I teach the neuron as a normal feed-forward model using a small (about 10) number of gradient descent iterations on the loss function (v - v')^2.
5. I clear the list of records of parties.
6. I repeat a large number of times.

For the game of "handles" (you have N handles, you can take 1/2/3 and pass the move to your opponent, if you take the last one - you lose), this algorithm worked quite reasonably. But when I tried to apply the same logic to tic-tac-toe, it didn't work. I tried perceptrons with 2 or 1 hidden layers as models (all input values are + -1 or 0 depending on the cell) and others. As a non-linear function, I used leak relu, which is "curved" at x=+-1, because The neuron should strive to output +-1 values, respectively. Yes, after a really huge number of games, the bot starts to draw when playing against me, but he could not finish learning to the point where he would play 5 games in a row in a draw.

What could be the problem? In the algorithm itself? In misconfigured learning settings? In the architecture of neurons?

PS I'm using c++ for all of this, so it's best to use pseudo code as an example.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

I

imageman, 2021-08-11
@PrizmMARgh

1. The activation function is always non-linear (therefore leak relu must be a broken line, it must not be a simple straight line).
2. Activations of different layers can be different - in hidden layers leak relu, in the final layer - sigmoid
3. One hidden layer is not enough. Try 2-3 at least.
4. Try different learning rates (0.001)
5. The number of gradient descent iterations can and should be done more.
Do you really need to write all this yourself in C ++? Maybe it's better to take ready-made implementations of learning libraries (PyTorch, tensorflow)? The ONNX project https://ru.wikipedia.org/wiki/ONNX allows you to export models (but I haven't tried it myself yet). Also, no one forbids calling a Python script from a C ++ program (as an external program).