CN108549237B

CN108549237B - Preset control humanoid robot gait planning method based on deep reinforcement learning

Info

Publication number: CN108549237B
Application number: CN201810465382.3A
Authority: CN
Inventors: 毕盛; 刘云达; 董敏; 张英杰; 闵华清
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2020-04-28
Anticipated expiration: 2038-05-16
Also published as: CN108549237A

Abstract

The invention discloses a preshoot control humanoid robot gait planning method based on deep reinforcement learning, which comprises the following steps: 1) acquiring state information through a sensor assembled on the humanoid robot; 2) the existing deep reinforcement learning network is improved, and brand new states, action vectors and reward functions are defined; 3) correcting the output of the preview controller by using the defined action vector, calculating the angles of steering engines of the legs of the humanoid robot, and guiding the humanoid robot to walk; 4) and in the walking process of the humanoid robot, updating the improved deep reinforcement learning network by using the values of the state, the motion vector and the reward function. The method can effectively solve the walking problem of the humanoid robot in a complex environment, tests are carried out on the simulation platform and the physical robot, and the effectiveness of the method is verified.

Description

Preset control humanoid robot gait planning method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of humanoid robots, in particular to a preshoot control humanoid robot gait planning method based on deep reinforcement learning.

Background

One basic function of a humanoid robot is to stabilize walking. However, due to the characteristics of complexity of the composition structure of the humanoid robot, strong coupling relation, poor module independence and the like, the function of stable walking of the humanoid robot is difficult to realize. Therefore, the gait control and planning problem of the humanoid robot also becomes a research hotspot in the related field at present. Conventional gait control methods can be broadly divided into two categories: methods based on modern control theory and methods based on walking mechanisms. However, most of these methods are old and not suitable for the current increasingly complex model mechanism. Recently, various machine learning methods are continuously proposed and innovated, and the development of dynamic gait control is stimulated. Compared with the traditional control theory, the method based on machine learning does not need a large amount of prior knowledge about a complex model, is easy to realize, and can reach the level which is comparable with the traditional control theory.

Deep reinforcement learning methods have proven effective in complex control problems. By means of learning, the problem that designers of the system have insufficient knowledge of system dynamics is solved, and the methods may provide complete solutions beyond the knowledge field of the designers. At the same time, such methods have the ability to learn and improve continuously, learning and adapting to complex environments.

Disclosure of Invention

The invention mainly researches the gait planning function of the humanoid robot when walking in a complex ground environment, provides a prespection control humanoid robot gait planning method based on deep reinforcement learning aiming at the problem that the prior control theory cannot effectively solve the walking problem of the humanoid robot in the complex environment, can effectively solve the walking problem of the humanoid robot in the complex environment, and tests are carried out on a simulation platform and an entity robot to verify the effectiveness of the method.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: the presbyope control humanoid robot gait planning method based on deep reinforcement learning comprises the following steps:

1) acquiring state information through a sensor assembled on the humanoid robot;

2) the existing deep reinforcement learning network is improved, and brand new states, action vectors and reward functions are defined;

3) correcting the output of the preview controller by using the defined action vector, calculating the angles of steering engines of the legs of the humanoid robot, and guiding the humanoid robot to walk;

4) and in the walking process of the humanoid robot, updating the improved deep reinforcement learning network by using the values of the state, the motion vector and the reward function.

In the step 1), state information is acquired through a sensor assembled on the humanoid robot, and the stability degree of the humanoid robot during walking is mainly influenced by a steering engine in the pitching direction on a supporting foot, so that in the defined state information, supporting foot information and angle information of the pitching steering engine on the supporting foot are provided, in addition, values of acceleration and angular velocity are needed to judge the stability condition of the humanoid robot in the walking process, and then the off-line gait is adjusted in real time so as to adapt to the uneven terrain environment;

[α,ω,θ_lhip,θ_rhip,θ_lankle,θ_rankle]

wherein α represents the square root of the sum of the squares of the accelerations of the humanoid robot in the x-axis and y-axis directions, ω represents the square root of the sum of the squares of the angular velocities of the humanoid robot in the x-axis and y-axis directions, and θ_lhip,θ_rhip,θ_lankle,θ_rankleThe angles of the steering engines in the pitching directions of the hip joints and the ankle joints of the left leg and the right leg of the humanoid robot are shown.

In step 2), the improved deep reinforcement learning network adopts a deep deterministic strategy gradient method DDPG, which specifically comprises the following steps:

2.1) definition of deep reinforcement learning related variables

Compensating the control output of the preview controller by a deep reinforcement learning method, wherein relevant variables including a state vector, an action vector and a reward function need to be defined firstly when the deep reinforcement learning is used;

the output controlled by the preview controller is a two-dimensional vector which respectively corresponds to the output values of the coordinates of the centroid in the x-axis direction and the y-axis direction, so that the action of the deep reinforcement learning network is defined as follows:

wherein, is_xAnd Δ μ_yRespectively corresponding to the output variation of each dimension of the preview controller;

considering the desire for humanoid robots to walk, and hopefully for humanoid robots to be able to remain stable the farther away, the reward function is defined as:

wherein if the humanoid robot can smoothly go to the terminal point, the return value is 50; if the humanoid robot falls down in the walking process, the return value is-50; if the current state of the robot is the other state, the current state of the robot is referred to;

sum of squares r of acceleration_α(t) is defined as:

wherein, α_x(x) And α_y(t) values representing the acceleration of the humanoid robot in the x-axis direction and the y-axis direction at the time t, respectively;

sum of squares of angular velocities r_ω(t) is defined as:

wherein, ω is_x(x) And ω_y(t) values representing angular velocities of the humanoid robot in the x-axis direction and the y-axis direction at the time t, respectively;

x _ dis represents the distance traveled by the humanoid robot;

2.2) construction of deep reinforcement learning network

When the DDPG is realized, an Actor network and a criticic network are required to be respectively constructed for training, wherein the criticic network is used for parameterizing a behavior value function; the Actor network is used for guiding the updating of a strategy function according to the value obtained by the Critic network, and the Critic network has the specific structure as follows:

an input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has 9 dimensions in total;

hiding the layer: the hidden layer is 2 layers, wherein the first layer has 402 nodes, including 2 nodes representing actions; the second layer is 300 nodes; the activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:

y_i(t)＝max(t,0),i＝1,2,…n

represents the output y of the ith neuron_i(t) taking the greater of 0 and t;

an output layer: q (t) represents the output value of the strategy function, and has 1 dimension;

the concrete structure of the Actor network is as follows:

hiding the layer: the hidden layer is 2 layers, wherein the first layer comprises 400 nodes, and the second layer comprises 300 nodes; the activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:

y_i(t)＝max(t,0),i＝1,2,…n

represents the output y of the ith neuron_i(t) taking the greater of 0 and t;

an output layer: a (t) represents the output action value, and the total dimension is 2;

updating the Critic and Actor networks by using a BP algorithm and a gradient descent method, and outputting the weight w of each neuron_iThe formula is updated as follows:

wherein, w_iIs the (i) th weight, and is,

for the learning rate, E is the learning performance index of the two networks;

in the step 3), correcting the correction quantity output by the pre-observation controller by using an improved deep reinforcement learning network, and calculating the angles of all steering engines of the two legs of the humanoid robot on the basis of the corrected pre-observation controller to guide the robot to walk; the theoretical emphasis of the conventional preview controller is to utilize future information for control, wherein the future information refers to the future N_pTarget ZMP reference value within step, and if the current time point is k, then N is in the future_pThe pose of the two feet within the step is obtained by calculating the three-dimensional walking mode, and further N is obtained_pTarget ZMP reference within steps: ZMP^* _k+1,…,ZMP^* _k+Np(ii) a These future target ZMP reference values are then stored in the FIFIn the O buffer, the output value is used as the current reference value, the ZMP reference value in the FIFO buffer for the preview controller and the state calculation control output of the humanoid robot are used, and the control output formula is as follows:

wherein u is_kIs output by the controller at the moment k; c, K_s，K_x，

Is the controller coefficient;

is the centroid coordinate of the humanoid robot at the time of k, [ ZMP^* _k+1,…,ZMP^* _k+Np]^TFrom time k +1 to time k + N_pReference ZMP of (1);

training the correction quantity of the preview control output value through a network of deep reinforcement learning;

u′_k＝u_k+Δu_k

after control input is obtained, calculating a centroid coordinate at the k +1 moment;

using the centroid coordinate (x) at time k +1_k+1,y_k+1) This can obtain the centroid position and the left and right foot position at the time k + 1:

wherein G is_cobpresent，G_lpresentAnd G_rpresentThe position and posture of the left foot and the right foot are the centroid at the moment k + 1; and finally, calculating the steering engine angles of the legs of the humanoid robot according to the inverse kinematics principle to obtain the steering engine angles of each joint of the legs at the moment of k +1, so as to know the walking of the humanoid robot.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the method adopts the idea of deep reinforcement learning on the basis of the existing preview control theory, and accelerates the convergence speed.

2. The method is simple and easy to implement, can control the walking motion of the humanoid robot on line, timely adjusts the gait of the robot, helps the humanoid robot to realize stable walking on uneven ground, and has certain practical significance and application value.

Drawings

FIG. 1 is a diagram of a Critic network architecture.

Fig. 2 is a diagram of an Actor network structure.

Fig. 3 is a preview control flowchart.

Fig. 4 is a flow chart of the preview control based on the deep reinforcement learning.

FIG. 5 is a graph showing the effect of walking experiment.

Detailed Description

The present invention will be further described with reference to the following specific examples.

The prespection control humanoid robot gait planning method based on deep reinforcement learning provided by the embodiment has the following specific conditions:

1) acquisition of humanoid robot states

The state information is acquired by a sensor mounted on the humanoid robot. The stability degree of humanoid robot when walking is mainly influenced by the steering wheel of every single move direction on the supporting legs, consequently in the state information of definition, should provide the angle information of the every single move steering wheel on supporting legs information and the supporting legs. In addition, the values of acceleration and angular velocity are also needed to determine the stability of the humanoid robot during walking. And then the off-line gait is adjusted in real time so as to adapt to the uneven terrain environment.

[α,ω,θ_lhip,θ_rhip,θ_lankle,θ_rankle]

2.1) definition of deep reinforcement learning related variables

The walk pattern generation method based on the preview controller cannot guarantee the stability of the motion that is difficult to describe with such a simple model. Complex movements, such as large swings in the upper body posture, arm swings, and large discrepancy between the reference and actual values of the ZMP. Therefore, it is necessary to compensate the control output of the preview controller by means of deep reinforcement learning. The deep reinforcement learning method adopted in the present embodiment is a method of Deep Deterministic Policy Gradient (DDPG). The method has the advantages that continuous results can be output, and the performance of the method in a complex scene is better than that of similar results.

To use deep reinforcement learning, relevant variables including a state vector, an action vector and a reward function need to be defined first. The description of the state has already been described above in step 1), and is therefore not described in detail.

The output controlled by the preview controller is a two-dimensional vector which respectively corresponds to the output values of the coordinates of the centroid in the x-axis direction and the y-axis direction. The actions of the deep reinforcement learning network are thus defined as:

wherein, is_xAnd Δ μ_yRespectively corresponding to the output variation of each dimension of the preview controller.

Considering the desire for humanoid robots to walk, we want humanoid robots to be able to remain stable the farther they walk, defining the reward function as:

wherein if the humanoid robot can smoothly go to the terminal point, the return value is 50; if the humanoid robot falls down in the walking process, the return value is-50; and if the current state is the other state, the current state of the robot is referred to.

Sum of squares r of acceleration_α(t) is defined as:

wherein, α_x(x) And α_y(t) represents the values of the acceleration of the humanoid robot in the x-axis direction and the y-axis direction at time t, respectively.

Sum of squares of angular velocities r_ω(t) is defined as

Wherein, ω is_x(x) And ω_y(t) represents the values of the angular velocity of the humanoid robot in the x-axis direction and the y-axis direction at time t, respectively.

x _ dis represents the distance traveled by the humanoid robot.

2.2) construction of deep reinforcement learning network

When the DDPG is realized, an Actor network and a Critic network are required to be respectively set up for training. The role of the Critic network is to parameterize a behavior cost function; the Actor network is used for guiding the updating of the strategy function according to the value obtained by the Critic network. As shown in fig. 1, the specific structure of the Critic network is:

hiding the layer: the hidden layer is 2 layers, wherein the first layer has 402 nodes, and the nodes comprise 2 nodes representing actions; the second level is 300 nodes. The activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:

y_i(t)＝max(t,0),i＝1,2,…n

represents the output y of the ith neuron_i(t) takes the larger of 0 and t.

An output layer: q (t) represents the output value of the policy function, in 1 dimension.

As shown in fig. 2, the specific structure of the Actor network is as follows:

hiding the layer: the hidden layer is 2 layers, where the first layer has 400 nodes and the second layer 300 nodes. The activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:

y_i(t)＝max(t,0),i＝1,2,…n

represents the output y of the ith neuron_i(t) takes the larger of 0 and t.

An output layer: a (t) represents the output motion value, and has 2 dimensions.

wherein, w_iIs the (i) th weight, and is,

for the learning rate, E is the learning performance index of the two networks.

3) And correcting the correction quantity output by the preview controller by using an improved depth reinforcement learning network, and calculating the angles of all steering engines of the two legs of the humanoid robot on the basis of the corrected preview controller to guide the robot to walk.

The theoretical emphasis of a conventional look-ahead controller is to utilize future information for control. In the present embodiment, the future information is future N_pTarget ZMP reference values within steps. Let the current time point be k, then N is in the future_pTarget within step ZMP reference value (ZMP)^* _k+1,…,ZMP^* _k+Np). These future target ZMP reference values are then stored in a FIFO (first-in-first-out) buffer, the output of which serves as the current reference value. The preview controller calculates a control output using the ZMP reference value in the FIFO buffer and the state of the humanoid robot. The formula of the control output is:

wherein u is_kIs the output of the controller at time K, c, K_s，K_x，

In order to be a function of the controller,

is the centroid coordinate of the humanoid robot at the time of k, [ ZMP^* _k+1,…,ZMP^* _k+Np]^TFrom time k +1 to time k + N_pReference to (3) ZMP.

Training correction amount u 'of preview control output value through network of deep reinforcement learning'_k。

u′_k＝u_k+Δu_k

After the control input is obtained, the centroid coordinate at the time k +1 can be calculated.

Using the centroid coordinate (x) at time k +1_k+1,y_k+1). Thus, the mass center pose and the left and right foot poses at the moment of k +1 can be obtained

Wherein G is_cobpresent，G_lpresentAnd G_rpresentThe position of the centroid at the moment k +1 and the positions of the left foot and the right foot. And then calculating the steering engine angles of the two legs of the humanoid robot according to the inverse kinematics principle to obtain the steering engine angles of each joint of the two legs at the moment of k +1 so as to know the walking of the humanoid robot, wherein the specific flow is shown in figure 3.

For each output (u) to the anticipatory controller during the humanoid robot walking process_x,u_y) And calculating to obtain the current state, and using the deep reinforcement learning DDPG to learn a group of correction quantities for output so as to update the network of the deep reinforcement learning. And meanwhile, the walking posture of the humanoid robot is calculated by utilizing the output of the preview controller. In summary, the algorithm steps are as follows, and specifically as shown in fig. 4:

① initializing the deep reinforcement learning DDPG framework and the preview controller;

② obtaining the current state according to the sensor information, and calculating a group of correction values related to the preview controller by using the deep reinforcement learning DDPG;

③ adding the output of the controller to the output of the controller, and guiding the humanoid robot to walk according to the output value and the inverse kinematics principle;

④, acquiring the immediate return value of the current system, and updating the deep reinforcement learning framework;

⑤, judging the current state of the humanoid robot, if the humanoid robot falls down or walks to the target place, ending the loop, otherwise, jumping to step ②.

Fig. 5 shows the experimental walking effect of the humanoid robot.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. The presbying control humanoid robot gait planning method based on deep reinforcement learning is characterized by comprising the following steps of:

3) correcting the output of the preview controller by using the defined action vector, calculating the angles of all steering engines of the two legs of the humanoid robot, and guiding the humanoid robot to walk, wherein the method specifically comprises the following steps:

correcting the correction quantity output by the pre-observation controller by using an improved deep reinforcement learning network, and calculating the angles of all steering engines of the legs of the humanoid robot on the basis of the corrected pre-observation controller to guide the robot to walk; the theoretical emphasis of the conventional preview controller is to utilize future information for control, wherein the future information refers to the future N_pTarget ZMP reference value within step, and if the current time point is k, then N is in the future_pThe pose of the two feet within the step is obtained by calculating the three-dimensional walking mode, and further N is obtained_pTarget ZMP reference within steps: ZMP^* _k+1,…,ZMP^* _k+Np(ii) a Then storing the future target ZMP reference values in an FIFO buffer, taking the output value as the current reference value, and calculating a control output by using the ZMP reference value in the FIFO buffer and the state of the humanoid robot by using a formula of the control output:

wherein u is_kIs output by the controller at the moment k; c, K_s，K_x，

Is the controller coefficient;

u′_k＝u_k+Δu_k

wherein G is_cobpresent，G_lpresentAnd G_rpresentThe position and posture of the left foot and the right foot are the centroid at the moment k + 1; finally, according to the inverse kinematics principle, calculating the steering engine angles of the two legs of the humanoid robot to obtain the steering engine angles of each joint of the two legs at the moment of k +1 so as to know the walking of the humanoid robot;

2. The deep reinforcement learning-based prespective control humanoid robot gait planning method according to claim 1, characterized in that: in the step 1), state information is acquired through a sensor assembled on the humanoid robot, and the stability degree of the humanoid robot during walking is mainly influenced by a steering engine in the pitching direction on a supporting foot, so that in the defined state information, supporting foot information and angle information of the pitching steering engine on the supporting foot are provided, in addition, values of acceleration and angular velocity are needed to judge the stability condition of the humanoid robot in the walking process, and then the off-line gait is adjusted in real time so as to adapt to the uneven terrain environment;

[α,ω,θ_lhip,θ_rhip,θ_lankle,θ_rankle]

3. The deep reinforcement learning-based prespective control humanoid robot gait planning method according to claim 1, characterized in that: in step 2), the improved deep reinforcement learning network adopts a deep deterministic strategy gradient method DDPG, which specifically comprises the following steps:

2.1) definition of deep reinforcement learning related variables

square root r of sum of squares of acceleration_α(t) is defined as:

square root r of the sum of squares of angular velocities_ω(t) is defined as:

x _ dis represents the distance traveled by the humanoid robot;

2.2) construction of deep reinforcement learning network

y_i(t)＝max(t,0),i＝1,2,…n

represents the output y of the ith neuron_i(t) taking the greater of 0 and t;

the concrete structure of the Actor network is as follows:

y_i(t)＝max(t,0),i＝1,2,…n

represents the output y of the ith neuron_i(t) taking the greater of 0 and t;

wherein, w_iIs the (i) th weight, and is,

for the learning rate, E is the learning performance index of the two networks.