Python plays CartPole (DQN)

Jun 01, 2021 Article blog

This article was reproduced to Know ID: Charles (Bai Lu) knows his personal column

Download the W3Cschool Mobile App, 0 Foundation Anytime, Anywhere Learning Programming >> Poke this to learn

Lead

Use Python to build a simple deep-enhanced learning network (DQN) to play CartPole, a small game...

This is a simple example from the official PyTorch tutorial.

It still feels interesting

Go straight to the main topic

The content is longer, do a good job of psychological preparation

Develop tools

System: Windows 10

Python version: 3.6.4

Related modules:

gym module;

numpy module;

matplotlib module;

PIL module;

torch module;

torchvision module;

and some Python's own modules.

The PyTorch version is:

0.3.0

Environment construction

Install Python and add it to the environment variable, and pip installs the relevant modules that are required.

Additional notes:

PyTorch does not support direct pip installations for the time being.

There are two options:

(1) after the installation of anaconda3 installed in the environment of anaconda3 (direct pip installation can be);

2 Use the compiled whl file installation, download link is:

https://pan.baidu.com/s/1dF6ayLr#list/path=%2Fpytorch

Introduction to the principle

(1) Enhance learning

In order to facilitate a better understanding, I decided to start with enhanced learning, of course, only the main ideas, not in-depth discussion.

In ai, we typically use Agent to represent an object that is capable of acting, such as an unmanned vehicle that has recently been on fire. T he problem with enhanced learning considerations, then, is the task of interacting between Agent and environment environment. Let's say we have an AI Pikachu:

Now we're going to let this Pikachu pick up the rice balls in the upper left corner. Then the objects around Pikachu include the rice ball is the environment, Pikachu perceives the environment through external cameras such as the camera (assuming Pikachu's eyes are a pair of cameras), and then Pikachu needs to output a series of actions to pick up the rice balls.

Of course you can get Pikachu to do other things.

However, whatever the task, it includes a series of action actions, observational observation, and feedback value Reward.

The so-called Reward is when Agent performs an action to interact with the environment, the environment changes, the change of good and bad is The Reward to represent. For example, in the example above:

If Pikachu gets closer to the rice balls, then Reward should be positive, otherwise it should be negative.

The word Observation observation is used here instead of the environment because Agent does not necessarily get all the information about the environment, such as the camera on Pikachu can only get a picture at a certain angle. T herefore, only Observation can be used to represent the perceived information obtained by Agent. In fact, the same is true of human interaction with the environment.

At each point in time, Agent selects an action execution from the action set A that you can select. This action set can be continuous or discrete, and the number of action sets will directly affect the difficulty of solving the entire task.

Then knowing the whole process, the goal of the task comes out, which is to get as much Reward as possible. W ithout goals, control is impossible, so getting Reward is a quantitative criterion, and the more Rewards, the better the execution. F or each time slice, Agent determines the next move based on current observations. E ach observation acts as the state of Agent. T hus, there is a mapping relationship between state state and action action, i.e. a state can correspond to an action, or the probability (often expressed in probability) of different actions. T hen the state-to-action process is called a strategy Polly. Of course, it can also be a mapping of a previous series of state actions that are assembled into the action.

In summary, the task of enhancing learning is to find an optimal strategy to make the most of the reward.

Of course, we don't know what the optimal strategy is at first, so we often start with a random strategy and experiment with a random strategy to get a range of status, actions, and feedback. T hat is, a series of samples Ofample. T he algorithm for enhanced learning is to improve the polliy based on these samples, so that the resulting samples are better. It is this feature that makes reward better and better that the algorithm is called enhanced learning.

（2）Q-Learning

Since the sample of enhanced learning is a time series, the problem of enhanced learning is modeled and the Markov decision-making process is introduced.

Simply put, we can assume that:

Agent's next state depends only on the current state and the current action. N ote that the state here is a fully observable state of the environment. T hat is, as long as we have an initial state, the successor state is fully determined (the vast majority of enhanced learning can be modeled as Markov decision-making issues). ）

Of course, the reality of the environment is generally not fully observable, and then there is some randomness, then can only be estimated.

Since a state corresponds to an action, or action probability, and with an action, the next state is determined. T his means that each state can be described with a definite value. F rom this you can tell whether a state is a good state or a bad state. F or example, the previous Pikachu to the upper left corner must be in good shape, to the lower right corner must be in a bad state. S o the state of good or bad is actually equivalent to the expectation of future returns. S o we can introduce a return Return to represent the return that a state of time will have. I n addition, we introduce a concept valuation function, which is used to represent the potential value of a state's future, that is, it becomes Pikachu looking at the upper left corner to feel that there is rice balls and then the upper left corner of the valuation is high. T hat is to say, the valuation function is the expectation of return. Defining the valuation function is the next question of how to solve it.

This part I decided to go by instead of putting on a bunch of formulas to scare people.

In fact, solving a valuation function requires only simple derivation from the definition. Finally, it can be proved that the valuation function can be calculated through iteration.

Considering that each state is followed by a variety of actions to choose from, and the state under each action is different, we are more concerned about the valuation of different actions in a state. T hat is, according to the valuation of each action to choose the best one to execute, this is the action valuation function Q. It is important to note that the reward here is the reward after the execution of the action, while the previous reward is the expectation of the reward corresponding to the state, i.e. the reward for multiple actions.

Obviously, what we're asking for now is the optimal action valuation function. T his can be solved using the method of valuation iteration, the core idea of which is to update the current Q value each time based on the newly obtained reward and the original Q value. Q-Learning's idea is based entirely on valuation iterations.

But in practice, we have no way to traverse all the states and all the movements, we can only get a limited sample, so Q-Learning proposed a similar gradient decline method to reduce the valuation error, that is, each time a small step toward the goal, and finally can converge to the optimal Q value.

（3）DQN

To put it bluntly, a deep neural network is used as a network of Q values.

(4) Finally, take a look at our CartPole mini-game

We need to control the cart movement below so that the rod connected to the top remains vertical. This task is simplified to only two discrete movements, which are left or right forces.

If the bar is too tilted, or the cart moves out of a range, the game is over.

The specific modeling and implementation process can be referred to the official documents I translated.

Here I will only talk about the main ideas.

At first, we didn't know anything, that is, the hyperparametrics in the DQN network were completely random. Practice is the only criterion for testing truth, and we need practice, so we will choose the next action based on the current state, and action will be chosen as follows:

(1) Using the DQN network to select an action (of course, the DQN network at this time is of little use, because all its superparametrics are randomly generated);

(2) Randomly select an action.

Why random selection? Simply put, to explore the new world, if we follow the model without innovation, it is clear that we will never be able to progress.

Once we've selected the action, we'll get feedback from environment. At the same time, we are also entering a new state.

If the new state doesn't make the game over, then let's move on.

If the game is over, then we'll start over.

In the process, our DQN network has been updated, and the goal of the update is of course to make the action taken under different state to get the best reward.

In this way, in addition to the first DQN is in the blind election, after the choice still has a certain experience basis.

With the action again and again, our model will eventually become more and more excellent, after all, failure is the mother of success, it is also important to learn lessons

That's ALL。

Use the demo

General CartPole (i.e. completely random action):

We could find it unruly, and soon over.

DQN plays CartPole (see website):

At the beginning of training:

After a period of training:

This is the source code that I wrote according to the official website tutorial, made a more detailed note to the

We can see that as the number of trainings increases, so does the amount of time it takes to survive.

Other than that:

The source code also provides a source code from a YouTube host, who streamlines the source code of the official website for reference by those in need.

T_T some of the article's content sources self-organizing DQN introductory material (the information is basically from foreign related disciplines of the big man lectures and some tutorials).

In this way, if there is anything wrong can also give me a message, I will correct the