How does reinforcement learning save the found optimal strategy?

O

Oleg Petrov2018-10-02 00:57:14

Python

Oleg Petrov, 2018-10-02 00:57:14

I continue to analyze the code of the program https://github.com/Smeilz/Tic-Tac-Toe-Reinforcemen...
1) The program simulates 200000 games of 2 opponents in tic-tac-toe 3x3
2) Saves the strategy to a file using pickle
3) You can play with the trained strategies, it is loaded again with the help of pickle
I output the saved object via print, a huge text is displayed there

'X', '0', '0', 'X', 'X', '0'), 3): 1.0, (('X', 'X', ' ', ' ', ' ', ' ', ' ', '0', '0'), 3): 1.203194073499, (('X', 'X', ' ', ' ', ' ', ' ', ' ', '0', '0'), 4): 0.97, (('X', 'X', ' ', ' ', ' ', ' ', ' ', '0', '0'), 5): 1.0, (('X', 'X', ' ', ' ', ' ', ' ', ' ', '0', '0'), 6): 1.0, (('X', 'X', ' ', ' ', ' ', ' ', ' ', '0', '0'), 7): 1.8822040593129998, (('X', 'X', ' ', '0', 'X', ' ', ' ', '0', '0'), 3): 0.92401, (('X', 'X', ' ', '0', 'X', ' ', ' ', '0', '0'), 6): 0.43899999999999995, (('X', 'X', ' ', '0', 'X', ' ', ' ', '0', '0'), 7): 1.8999999669669685, (('X', 'X', ' ', '0', 'X', ' ', '0', '0', '0'), 3): 1.0, (('X', 'X', ' ', '0', 'X', ' ', '0', '0', '0'), 6): 1.0, (('0', ' ', '0', ' ', 'X', ' ', 'X', ' ', ' '), 2): 1.899999952809955, (('0', ' ', '0', ' ', 'X', ' ', 'X', ' ', ' '), 4): 0.707281, (('0', ' ', '0', ' ', 'X', ' ', 'X', ' ', ' '), 6): 1.6262611862579543, .............

upd: The save format is this.
[Result1 of Q-function, Situation1 on the board, Cell number where do we go this round] : [Result2 of Q-function, Situation2 on the board, Cell number where do we go this round] : etc
Prompt if q- learning is essentially a function of accumulating the utility of the player's actions, how can the agent's saved optimal strategy be recreated for participation in subsequent games?
Everything is clear with the neural network - we save the weights and then recreate the network for another data set.
And what about the result of training on Q-learning. Or does he save the entire chain of games for all positions and in the end just compares all possible continuations from any place in the game?
And then selects only that continuation, which has a utility index (function Q) was max?
I correctly understood the logic of the program?
But what happens if the game is more difficult than this. There, after all, the list of options for moves will be simply huge.
For example chess. It turns out that it will be impossible to apply Qlearning for chess? But there are still orders of magnitude more complex games, such as poker. It will take years to save all combinations of cards.

Reply

Answer the question

In order to leave comments, you need to log in

1 answer(s)

X

xmoonlight, 2018-10-02
@Smeilz1

Everything is clear with the neural network - we save the weights and then recreate the network for another data set.
And what about the result of training on Q-learning.

learn Q-learning
Slides
PS: Code is clearer and more correct here!