CN108549237B - Preset control humanoid robot gait planning method based on deep reinforcement learning - Google Patents

Preset control humanoid robot gait planning method based on deep reinforcement learning Download PDF

Info

Publication number
CN108549237B
CN108549237B CN201810465382.3A CN201810465382A CN108549237B CN 108549237 B CN108549237 B CN 108549237B CN 201810465382 A CN201810465382 A CN 201810465382A CN 108549237 B CN108549237 B CN 108549237B
Authority
CN
China
Prior art keywords
humanoid robot
output
reinforcement learning
deep reinforcement
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810465382.3A
Other languages
Chinese (zh)
Other versions
CN108549237A (en
Inventor
毕盛
刘云达
董敏
张英杰
闵华清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201810465382.3A priority Critical patent/CN108549237B/en
Publication of CN108549237A publication Critical patent/CN108549237A/en
Application granted granted Critical
Publication of CN108549237B publication Critical patent/CN108549237B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a preshoot control humanoid robot gait planning method based on deep reinforcement learning, which comprises the following steps: 1) acquiring state information through a sensor assembled on the humanoid robot; 2) the existing deep reinforcement learning network is improved, and brand new states, action vectors and reward functions are defined; 3) correcting the output of the preview controller by using the defined action vector, calculating the angles of steering engines of the legs of the humanoid robot, and guiding the humanoid robot to walk; 4) and in the walking process of the humanoid robot, updating the improved deep reinforcement learning network by using the values of the state, the motion vector and the reward function. The method can effectively solve the walking problem of the humanoid robot in a complex environment, tests are carried out on the simulation platform and the physical robot, and the effectiveness of the method is verified.

Description

Preset control humanoid robot gait planning method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of humanoid robots, in particular to a preshoot control humanoid robot gait planning method based on deep reinforcement learning.
Background
One basic function of a humanoid robot is to stabilize walking. However, due to the characteristics of complexity of the composition structure of the humanoid robot, strong coupling relation, poor module independence and the like, the function of stable walking of the humanoid robot is difficult to realize. Therefore, the gait control and planning problem of the humanoid robot also becomes a research hotspot in the related field at present. Conventional gait control methods can be broadly divided into two categories: methods based on modern control theory and methods based on walking mechanisms. However, most of these methods are old and not suitable for the current increasingly complex model mechanism. Recently, various machine learning methods are continuously proposed and innovated, and the development of dynamic gait control is stimulated. Compared with the traditional control theory, the method based on machine learning does not need a large amount of prior knowledge about a complex model, is easy to realize, and can reach the level which is comparable with the traditional control theory.
Deep reinforcement learning methods have proven effective in complex control problems. By means of learning, the problem that designers of the system have insufficient knowledge of system dynamics is solved, and the methods may provide complete solutions beyond the knowledge field of the designers. At the same time, such methods have the ability to learn and improve continuously, learning and adapting to complex environments.
Disclosure of Invention
The invention mainly researches the gait planning function of the humanoid robot when walking in a complex ground environment, provides a prespection control humanoid robot gait planning method based on deep reinforcement learning aiming at the problem that the prior control theory cannot effectively solve the walking problem of the humanoid robot in the complex environment, can effectively solve the walking problem of the humanoid robot in the complex environment, and tests are carried out on a simulation platform and an entity robot to verify the effectiveness of the method.
In order to achieve the purpose, the technical scheme provided by the invention is as follows: the presbyope control humanoid robot gait planning method based on deep reinforcement learning comprises the following steps:
1) acquiring state information through a sensor assembled on the humanoid robot;
2) the existing deep reinforcement learning network is improved, and brand new states, action vectors and reward functions are defined;
3) correcting the output of the preview controller by using the defined action vector, calculating the angles of steering engines of the legs of the humanoid robot, and guiding the humanoid robot to walk;
4) and in the walking process of the humanoid robot, updating the improved deep reinforcement learning network by using the values of the state, the motion vector and the reward function.
In the step 1), state information is acquired through a sensor assembled on the humanoid robot, and the stability degree of the humanoid robot during walking is mainly influenced by a steering engine in the pitching direction on a supporting foot, so that in the defined state information, supporting foot information and angle information of the pitching steering engine on the supporting foot are provided, in addition, values of acceleration and angular velocity are needed to judge the stability condition of the humanoid robot in the walking process, and then the off-line gait is adjusted in real time so as to adapt to the uneven terrain environment;
[α,ω,θlhiprhiplanklerankle]
wherein α represents the square root of the sum of the squares of the accelerations of the humanoid robot in the x-axis and y-axis directions, ω represents the square root of the sum of the squares of the angular velocities of the humanoid robot in the x-axis and y-axis directions, and θlhiprhiplanklerankleThe angles of the steering engines in the pitching directions of the hip joints and the ankle joints of the left leg and the right leg of the humanoid robot are shown.
In step 2), the improved deep reinforcement learning network adopts a deep deterministic strategy gradient method DDPG, which specifically comprises the following steps:
2.1) definition of deep reinforcement learning related variables
Compensating the control output of the preview controller by a deep reinforcement learning method, wherein relevant variables including a state vector, an action vector and a reward function need to be defined firstly when the deep reinforcement learning is used;
the output controlled by the preview controller is a two-dimensional vector which respectively corresponds to the output values of the coordinates of the centroid in the x-axis direction and the y-axis direction, so that the action of the deep reinforcement learning network is defined as follows:
Figure BDA0001661922350000031
wherein, isxAnd Δ μyRespectively corresponding to the output variation of each dimension of the preview controller;
considering the desire for humanoid robots to walk, and hopefully for humanoid robots to be able to remain stable the farther away, the reward function is defined as:
Figure BDA0001661922350000032
wherein if the humanoid robot can smoothly go to the terminal point, the return value is 50; if the humanoid robot falls down in the walking process, the return value is-50; if the current state of the robot is the other state, the current state of the robot is referred to;
sum of squares r of accelerationα(t) is defined as:
Figure BDA0001661922350000033
wherein, αx(x) And αy(t) values representing the acceleration of the humanoid robot in the x-axis direction and the y-axis direction at the time t, respectively;
sum of squares of angular velocities rω(t) is defined as:
Figure BDA0001661922350000034
wherein, ω isx(x) And ωy(t) values representing angular velocities of the humanoid robot in the x-axis direction and the y-axis direction at the time t, respectively;
x _ dis represents the distance traveled by the humanoid robot;
2.2) construction of deep reinforcement learning network
When the DDPG is realized, an Actor network and a criticic network are required to be respectively constructed for training, wherein the criticic network is used for parameterizing a behavior value function; the Actor network is used for guiding the updating of a strategy function according to the value obtained by the Critic network, and the Critic network has the specific structure as follows:
an input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has 9 dimensions in total;
hiding the layer: the hidden layer is 2 layers, wherein the first layer has 402 nodes, including 2 nodes representing actions; the second layer is 300 nodes; the activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:
yi(t)=max(t,0),i=1,2,…n
represents the output y of the ith neuroni(t) taking the greater of 0 and t;
an output layer: q (t) represents the output value of the strategy function, and has 1 dimension;
the concrete structure of the Actor network is as follows:
an input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has 9 dimensions in total;
hiding the layer: the hidden layer is 2 layers, wherein the first layer comprises 400 nodes, and the second layer comprises 300 nodes; the activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:
yi(t)=max(t,0),i=1,2,…n
represents the output y of the ith neuroni(t) taking the greater of 0 and t;
an output layer: a (t) represents the output action value, and the total dimension is 2;
updating the Critic and Actor networks by using a BP algorithm and a gradient descent method, and outputting the weight w of each neuroniThe formula is updated as follows:
Figure BDA0001661922350000041
wherein, wiIs the (i) th weight, and is,
Figure BDA0001661922350000051
for the learning rate, E is the learning performance index of the two networks;
in the step 3), correcting the correction quantity output by the pre-observation controller by using an improved deep reinforcement learning network, and calculating the angles of all steering engines of the two legs of the humanoid robot on the basis of the corrected pre-observation controller to guide the robot to walk; the theoretical emphasis of the conventional preview controller is to utilize future information for control, wherein the future information refers to the future NpTarget ZMP reference value within step, and if the current time point is k, then N is in the futurepThe pose of the two feet within the step is obtained by calculating the three-dimensional walking mode, and further N is obtainedpTarget ZMP reference within steps: ZMP* k+1,…,ZMP* k+Np(ii) a These future target ZMP reference values are then stored in the FIFIn the O buffer, the output value is used as the current reference value, the ZMP reference value in the FIFO buffer for the preview controller and the state calculation control output of the humanoid robot are used, and the control output formula is as follows:
Figure BDA0001661922350000052
wherein u iskIs output by the controller at the moment k; c, Ks,Kx
Figure BDA0001661922350000053
Is the controller coefficient;
Figure BDA0001661922350000054
is the centroid coordinate of the humanoid robot at the time of k, [ ZMP* k+1,…,ZMP* k+Np]TFrom time k +1 to time k + NpReference ZMP of (1);
training the correction quantity of the preview control output value through a network of deep reinforcement learning;
u′k=uk+Δuk
after control input is obtained, calculating a centroid coordinate at the k +1 moment;
Figure BDA0001661922350000055
using the centroid coordinate (x) at time k +1k+1,yk+1) This can obtain the centroid position and the left and right foot position at the time k + 1:
Figure BDA0001661922350000056
Figure BDA0001661922350000057
Figure BDA0001661922350000061
wherein G iscobpresent,GlpresentAnd GrpresentThe position and posture of the left foot and the right foot are the centroid at the moment k + 1; and finally, calculating the steering engine angles of the legs of the humanoid robot according to the inverse kinematics principle to obtain the steering engine angles of each joint of the legs at the moment of k +1, so as to know the walking of the humanoid robot.
Compared with the prior art, the invention has the following advantages and beneficial effects:
1. the method adopts the idea of deep reinforcement learning on the basis of the existing preview control theory, and accelerates the convergence speed.
2. The method is simple and easy to implement, can control the walking motion of the humanoid robot on line, timely adjusts the gait of the robot, helps the humanoid robot to realize stable walking on uneven ground, and has certain practical significance and application value.
Drawings
FIG. 1 is a diagram of a Critic network architecture.
Fig. 2 is a diagram of an Actor network structure.
Fig. 3 is a preview control flowchart.
Fig. 4 is a flow chart of the preview control based on the deep reinforcement learning.
FIG. 5 is a graph showing the effect of walking experiment.
Detailed Description
The present invention will be further described with reference to the following specific examples.
The prespection control humanoid robot gait planning method based on deep reinforcement learning provided by the embodiment has the following specific conditions:
1) acquisition of humanoid robot states
The state information is acquired by a sensor mounted on the humanoid robot. The stability degree of humanoid robot when walking is mainly influenced by the steering wheel of every single move direction on the supporting legs, consequently in the state information of definition, should provide the angle information of the every single move steering wheel on supporting legs information and the supporting legs. In addition, the values of acceleration and angular velocity are also needed to determine the stability of the humanoid robot during walking. And then the off-line gait is adjusted in real time so as to adapt to the uneven terrain environment.
[α,ω,θlhiprhiplanklerankle]
Wherein α represents the square root of the sum of the squares of the accelerations of the humanoid robot in the x-axis and y-axis directions, ω represents the square root of the sum of the squares of the angular velocities of the humanoid robot in the x-axis and y-axis directions, and θlhiprhiplanklerankleThe angles of the steering engines in the pitching directions of the hip joints and the ankle joints of the left leg and the right leg of the humanoid robot are shown.
2.1) definition of deep reinforcement learning related variables
The walk pattern generation method based on the preview controller cannot guarantee the stability of the motion that is difficult to describe with such a simple model. Complex movements, such as large swings in the upper body posture, arm swings, and large discrepancy between the reference and actual values of the ZMP. Therefore, it is necessary to compensate the control output of the preview controller by means of deep reinforcement learning. The deep reinforcement learning method adopted in the present embodiment is a method of Deep Deterministic Policy Gradient (DDPG). The method has the advantages that continuous results can be output, and the performance of the method in a complex scene is better than that of similar results.
To use deep reinforcement learning, relevant variables including a state vector, an action vector and a reward function need to be defined first. The description of the state has already been described above in step 1), and is therefore not described in detail.
The output controlled by the preview controller is a two-dimensional vector which respectively corresponds to the output values of the coordinates of the centroid in the x-axis direction and the y-axis direction. The actions of the deep reinforcement learning network are thus defined as:
Figure BDA0001661922350000071
wherein, isxAnd Δ μyRespectively corresponding to the output variation of each dimension of the preview controller.
Considering the desire for humanoid robots to walk, we want humanoid robots to be able to remain stable the farther they walk, defining the reward function as:
Figure BDA0001661922350000081
wherein if the humanoid robot can smoothly go to the terminal point, the return value is 50; if the humanoid robot falls down in the walking process, the return value is-50; and if the current state is the other state, the current state of the robot is referred to.
Sum of squares r of accelerationα(t) is defined as:
Figure BDA0001661922350000082
wherein, αx(x) And αy(t) represents the values of the acceleration of the humanoid robot in the x-axis direction and the y-axis direction at time t, respectively.
Sum of squares of angular velocities rω(t) is defined as
Figure BDA0001661922350000083
Wherein, ω isx(x) And ωy(t) represents the values of the angular velocity of the humanoid robot in the x-axis direction and the y-axis direction at time t, respectively.
x _ dis represents the distance traveled by the humanoid robot.
2.2) construction of deep reinforcement learning network
When the DDPG is realized, an Actor network and a Critic network are required to be respectively set up for training. The role of the Critic network is to parameterize a behavior cost function; the Actor network is used for guiding the updating of the strategy function according to the value obtained by the Critic network. As shown in fig. 1, the specific structure of the Critic network is:
an input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has 9 dimensions in total;
hiding the layer: the hidden layer is 2 layers, wherein the first layer has 402 nodes, and the nodes comprise 2 nodes representing actions; the second level is 300 nodes. The activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:
yi(t)=max(t,0),i=1,2,…n
represents the output y of the ith neuroni(t) takes the larger of 0 and t.
An output layer: q (t) represents the output value of the policy function, in 1 dimension.
As shown in fig. 2, the specific structure of the Actor network is as follows:
an input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has 9 dimensions in total;
hiding the layer: the hidden layer is 2 layers, where the first layer has 400 nodes and the second layer 300 nodes. The activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:
yi(t)=max(t,0),i=1,2,…n
represents the output y of the ith neuroni(t) takes the larger of 0 and t.
An output layer: a (t) represents the output motion value, and has 2 dimensions.
Updating the Critic and Actor networks by using a BP algorithm and a gradient descent method, and outputting the weight w of each neuroniThe formula is updated as follows:
Figure BDA0001661922350000091
wherein, wiIs the (i) th weight, and is,
Figure BDA0001661922350000092
for the learning rate, E is the learning performance index of the two networks.
3) And correcting the correction quantity output by the preview controller by using an improved depth reinforcement learning network, and calculating the angles of all steering engines of the two legs of the humanoid robot on the basis of the corrected preview controller to guide the robot to walk.
The theoretical emphasis of a conventional look-ahead controller is to utilize future information for control. In the present embodiment, the future information is future NpTarget ZMP reference values within steps. Let the current time point be k, then N is in the futurepTarget within step ZMP reference value (ZMP)* k+1,…,ZMP* k+Np). These future target ZMP reference values are then stored in a FIFO (first-in-first-out) buffer, the output of which serves as the current reference value. The preview controller calculates a control output using the ZMP reference value in the FIFO buffer and the state of the humanoid robot. The formula of the control output is:
Figure BDA0001661922350000101
wherein u iskIs the output of the controller at time K, c, Ks,Kx
Figure BDA0001661922350000102
In order to be a function of the controller,
Figure BDA0001661922350000103
is the centroid coordinate of the humanoid robot at the time of k, [ ZMP* k+1,…,ZMP* k+Np]TFrom time k +1 to time k + NpReference to (3) ZMP.
Training correction amount u 'of preview control output value through network of deep reinforcement learning'k
u′k=uk+Δuk
After the control input is obtained, the centroid coordinate at the time k +1 can be calculated.
Figure BDA0001661922350000104
Using the centroid coordinate (x) at time k +1k+1,yk+1). Thus, the mass center pose and the left and right foot poses at the moment of k +1 can be obtained
Figure BDA0001661922350000105
Figure BDA0001661922350000106
Figure BDA0001661922350000107
Wherein G iscobpresent,GlpresentAnd GrpresentThe position of the centroid at the moment k +1 and the positions of the left foot and the right foot. And then calculating the steering engine angles of the two legs of the humanoid robot according to the inverse kinematics principle to obtain the steering engine angles of each joint of the two legs at the moment of k +1 so as to know the walking of the humanoid robot, wherein the specific flow is shown in figure 3.
For each output (u) to the anticipatory controller during the humanoid robot walking processx,uy) And calculating to obtain the current state, and using the deep reinforcement learning DDPG to learn a group of correction quantities for output so as to update the network of the deep reinforcement learning. And meanwhile, the walking posture of the humanoid robot is calculated by utilizing the output of the preview controller. In summary, the algorithm steps are as follows, and specifically as shown in fig. 4:
① initializing the deep reinforcement learning DDPG framework and the preview controller;
② obtaining the current state according to the sensor information, and calculating a group of correction values related to the preview controller by using the deep reinforcement learning DDPG;
③ adding the output of the controller to the output of the controller, and guiding the humanoid robot to walk according to the output value and the inverse kinematics principle;
④, acquiring the immediate return value of the current system, and updating the deep reinforcement learning framework;
⑤, judging the current state of the humanoid robot, if the humanoid robot falls down or walks to the target place, ending the loop, otherwise, jumping to step ②.
Fig. 5 shows the experimental walking effect of the humanoid robot.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims (3)

1. The presbying control humanoid robot gait planning method based on deep reinforcement learning is characterized by comprising the following steps of:
1) acquiring state information through a sensor assembled on the humanoid robot;
2) the existing deep reinforcement learning network is improved, and brand new states, action vectors and reward functions are defined;
3) correcting the output of the preview controller by using the defined action vector, calculating the angles of all steering engines of the two legs of the humanoid robot, and guiding the humanoid robot to walk, wherein the method specifically comprises the following steps:
correcting the correction quantity output by the pre-observation controller by using an improved deep reinforcement learning network, and calculating the angles of all steering engines of the legs of the humanoid robot on the basis of the corrected pre-observation controller to guide the robot to walk; the theoretical emphasis of the conventional preview controller is to utilize future information for control, wherein the future information refers to the future NpTarget ZMP reference value within step, and if the current time point is k, then N is in the futurepThe pose of the two feet within the step is obtained by calculating the three-dimensional walking mode, and further N is obtainedpTarget ZMP reference within steps: ZMP* k+1,…,ZMP* k+Np(ii) a Then storing the future target ZMP reference values in an FIFO buffer, taking the output value as the current reference value, and calculating a control output by using the ZMP reference value in the FIFO buffer and the state of the humanoid robot by using a formula of the control output:
Figure FDA0002358522070000011
wherein u iskIs output by the controller at the moment k; c, Ks,Kx
Figure FDA0002358522070000012
Is the controller coefficient;
Figure FDA0002358522070000013
is the centroid coordinate of the humanoid robot at the time of k, [ ZMP* k+1,…,ZMP* k+Np]TFrom time k +1 to time k + NpReference ZMP of (1);
training the correction quantity of the preview control output value through a network of deep reinforcement learning;
u′k=uk+Δuk
after control input is obtained, calculating a centroid coordinate at the k +1 moment;
Figure FDA0002358522070000021
using the centroid coordinate (x) at time k +1k+1,yk+1) This can obtain the centroid position and the left and right foot position at the time k + 1:
Figure FDA0002358522070000022
Figure FDA0002358522070000023
Figure FDA0002358522070000024
wherein G iscobpresent,GlpresentAnd GrpresentThe position and posture of the left foot and the right foot are the centroid at the moment k + 1; finally, according to the inverse kinematics principle, calculating the steering engine angles of the two legs of the humanoid robot to obtain the steering engine angles of each joint of the two legs at the moment of k +1 so as to know the walking of the humanoid robot;
4) and in the walking process of the humanoid robot, updating the improved deep reinforcement learning network by using the values of the state, the motion vector and the reward function.
2. The deep reinforcement learning-based prespective control humanoid robot gait planning method according to claim 1, characterized in that: in the step 1), state information is acquired through a sensor assembled on the humanoid robot, and the stability degree of the humanoid robot during walking is mainly influenced by a steering engine in the pitching direction on a supporting foot, so that in the defined state information, supporting foot information and angle information of the pitching steering engine on the supporting foot are provided, in addition, values of acceleration and angular velocity are needed to judge the stability condition of the humanoid robot in the walking process, and then the off-line gait is adjusted in real time so as to adapt to the uneven terrain environment;
[α,ω,θlhiprhiplanklerankle]
wherein α represents the square root of the sum of the squares of the accelerations of the humanoid robot in the x-axis and y-axis directions, ω represents the square root of the sum of the squares of the angular velocities of the humanoid robot in the x-axis and y-axis directions, and θlhiprhiplanklerankleThe angles of the steering engines in the pitching directions of the hip joints and the ankle joints of the left leg and the right leg of the humanoid robot are shown.
3. The deep reinforcement learning-based prespective control humanoid robot gait planning method according to claim 1, characterized in that: in step 2), the improved deep reinforcement learning network adopts a deep deterministic strategy gradient method DDPG, which specifically comprises the following steps:
2.1) definition of deep reinforcement learning related variables
Compensating the control output of the preview controller by a deep reinforcement learning method, wherein relevant variables including a state vector, an action vector and a reward function need to be defined firstly when the deep reinforcement learning is used;
the output controlled by the preview controller is a two-dimensional vector which respectively corresponds to the output values of the coordinates of the centroid in the x-axis direction and the y-axis direction, so that the action of the deep reinforcement learning network is defined as follows:
Figure FDA0002358522070000031
wherein, isxAnd Δ μyRespectively corresponding to the output variation of each dimension of the preview controller;
considering the desire for humanoid robots to walk, and hopefully for humanoid robots to be able to remain stable the farther away, the reward function is defined as:
Figure FDA0002358522070000032
wherein if the humanoid robot can smoothly go to the terminal point, the return value is 50; if the humanoid robot falls down in the walking process, the return value is-50; if the current state of the robot is the other state, the current state of the robot is referred to;
square root r of sum of squares of accelerationα(t) is defined as:
Figure FDA0002358522070000033
wherein, αx(x) And αy(t) values representing the acceleration of the humanoid robot in the x-axis direction and the y-axis direction at the time t, respectively;
square root r of the sum of squares of angular velocitiesω(t) is defined as:
Figure FDA0002358522070000041
wherein, ω isx(x) And ωy(t) values representing angular velocities of the humanoid robot in the x-axis direction and the y-axis direction at the time t, respectively;
x _ dis represents the distance traveled by the humanoid robot;
2.2) construction of deep reinforcement learning network
When the DDPG is realized, an Actor network and a criticic network are required to be respectively constructed for training, wherein the criticic network is used for parameterizing a behavior value function; the Actor network is used for guiding the updating of a strategy function according to the value obtained by the Critic network, and the Critic network has the specific structure as follows:
an input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has 9 dimensions in total;
hiding the layer: the hidden layer is 2 layers, wherein the first layer has 402 nodes, including 2 nodes representing actions; the second layer is 300 nodes; the activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:
yi(t)=max(t,0),i=1,2,…n
represents the output y of the ith neuroni(t) taking the greater of 0 and t;
an output layer: q (t) represents the output value of the strategy function, and has 1 dimension;
the concrete structure of the Actor network is as follows:
an input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has 9 dimensions in total;
hiding the layer: the hidden layer is 2 layers, wherein the first layer comprises 400 nodes, and the second layer comprises 300 nodes; the activation function for each neuron is a linear rectifying activation function whose output is calculated using the following formula:
yi(t)=max(t,0),i=1,2,…n
represents the output y of the ith neuroni(t) taking the greater of 0 and t;
an output layer: a (t) represents the output action value, and the total dimension is 2;
updating the Critic and Actor networks by using a BP algorithm and a gradient descent method, and outputting the weight w of each neuroniThe formula is updated as follows:
Figure FDA0002358522070000051
wherein, wiIs the (i) th weight, and is,
Figure FDA0002358522070000052
for the learning rate, E is the learning performance index of the two networks.
CN201810465382.3A 2018-05-16 2018-05-16 Preset control humanoid robot gait planning method based on deep reinforcement learning Expired - Fee Related CN108549237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810465382.3A CN108549237B (en) 2018-05-16 2018-05-16 Preset control humanoid robot gait planning method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810465382.3A CN108549237B (en) 2018-05-16 2018-05-16 Preset control humanoid robot gait planning method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN108549237A CN108549237A (en) 2018-09-18
CN108549237B true CN108549237B (en) 2020-04-28

Family

ID=63495020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810465382.3A Expired - Fee Related CN108549237B (en) 2018-05-16 2018-05-16 Preset control humanoid robot gait planning method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN108549237B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109483530B (en) * 2018-10-18 2020-11-20 北京控制工程研究所 Foot type robot motion control method and system based on deep reinforcement learning
CN109719721B (en) * 2018-12-26 2020-07-24 北京化工大学 Adaptive gait autonomous emerging method of snake-like search and rescue robot
CN109709967B (en) * 2019-01-22 2022-05-31 深圳市幻尔科技有限公司 Method for realizing dynamic gait of robot with low operation requirement
CN109871892A (en) * 2019-02-18 2019-06-11 华南理工大学 A kind of robot vision cognitive system based on small sample metric learning
CN110238839B (en) * 2019-04-11 2020-10-20 清华大学 Multi-shaft-hole assembly control method for optimizing non-model robot by utilizing environment prediction
CN110308727A (en) * 2019-07-12 2019-10-08 沈阳城市学院 A kind of control method for eliminating biped robot's upper body posture shaking
CN110562301B (en) * 2019-08-16 2020-12-01 北京交通大学 Subway train energy-saving driving curve calculation method based on Q learning
CN110496377B (en) * 2019-08-19 2020-07-28 华南理工大学 Virtual table tennis player ball hitting training method based on reinforcement learning
CN110764415B (en) * 2019-10-31 2022-04-15 清华大学深圳国际研究生院 Gait planning method for leg movement of quadruped robot
CN112782973B (en) * 2019-11-07 2022-10-18 四川省桑瑞光辉标识***股份有限公司 Biped robot walking control method and system based on double-agent cooperative game
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN110909859B (en) * 2019-11-29 2023-03-24 中国科学院自动化研究所 Bionic robot fish motion control method and system based on antagonistic structured control
CN111027143B (en) * 2019-12-18 2020-12-04 四川大学 Shipboard aircraft approach guiding method based on deep reinforcement learning
CN111191399B (en) * 2019-12-24 2021-11-05 北京航空航天大学 Control method, device and equipment of robot fish and storage medium
CN111142378A (en) * 2020-01-07 2020-05-12 四川省桑瑞光辉标识***股份有限公司 Neural network optimization method of biped robot neural network controller
CN111360834B (en) * 2020-03-25 2023-04-07 中南大学 Humanoid robot motion control method and system based on deep reinforcement learning
CN113627584B (en) * 2020-05-08 2024-04-12 南京大学 Mechanical arm inverse kinematics solving method based on neural network, electronic equipment and storage medium
CN112162554B (en) * 2020-09-23 2021-10-01 吉林大学 Data storage and backtracking platform for N3 sweeper
CN112666939B (en) * 2020-12-09 2021-09-10 深圳先进技术研究院 Robot path planning algorithm based on deep reinforcement learning
CN113031528B (en) * 2021-02-25 2022-03-15 电子科技大学 Multi-legged robot non-structural ground motion control method based on depth certainty strategy gradient
CN113156892B (en) * 2021-04-16 2022-04-08 西湖大学 Four-footed robot simulated motion control method based on deep reinforcement learning
KR102518406B1 (en) * 2022-12-30 2023-04-05 엑소아틀레트아시아 주식회사 Method of identifying gait pattern
CN117062280B (en) * 2023-08-17 2024-03-08 北京美中爱瑞肿瘤医院有限责任公司 Automatic following system of neurosurgery self-service operating lamp

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1393866A1 (en) * 2001-06-07 2004-03-03 Japan Science and Technology Corporation Apparatus walking with two legs; walking control apparatus; and walking control method thereof
CN104217107A (en) * 2014-08-27 2014-12-17 华南理工大学 Method for detecting tumbling state of humanoid robot based on multi-sensor information
CN106094817A (en) * 2016-06-14 2016-11-09 华南理工大学 Intensified learning humanoid robot gait's planing method based on big data mode
CN106584460A (en) * 2016-12-16 2017-04-26 浙江大学 Vibration suppression method in walking of humanoid robot
CN106842925A (en) * 2017-01-20 2017-06-13 清华大学 A kind of locomotive smart steering method and system based on deeply study
CN107944476A (en) * 2017-11-10 2018-04-20 大连理工大学 A kind of yellow peach stoning machine device people's behaviour control method based on deeply study

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1393866A1 (en) * 2001-06-07 2004-03-03 Japan Science and Technology Corporation Apparatus walking with two legs; walking control apparatus; and walking control method thereof
CN104217107A (en) * 2014-08-27 2014-12-17 华南理工大学 Method for detecting tumbling state of humanoid robot based on multi-sensor information
CN106094817A (en) * 2016-06-14 2016-11-09 华南理工大学 Intensified learning humanoid robot gait's planing method based on big data mode
CN106584460A (en) * 2016-12-16 2017-04-26 浙江大学 Vibration suppression method in walking of humanoid robot
CN106842925A (en) * 2017-01-20 2017-06-13 清华大学 A kind of locomotive smart steering method and system based on deeply study
CN107944476A (en) * 2017-11-10 2018-04-20 大连理工大学 A kind of yellow peach stoning machine device people's behaviour control method based on deeply study

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度强化学习的水下机器人最优轨迹控制;马琼雄等;《华南师范大学学报(自然科学版)》;20180228;第50卷(第1期);第118-123页 *

Also Published As

Publication number Publication date
CN108549237A (en) 2018-09-18

Similar Documents

Publication Publication Date Title
CN108549237B (en) Preset control humanoid robot gait planning method based on deep reinforcement learning
CN112666939B (en) Robot path planning algorithm based on deep reinforcement learning
JP5330138B2 (en) Reinforcement learning system
CN109240091B (en) Underwater robot control method based on reinforcement learning and tracking control method thereof
CN106607910B (en) A kind of robot imitates method in real time
CN104932264B (en) The apery robot stabilized control method of Q learning frameworks based on RBF networks
CN109760761B (en) Four-footed robot motion control method based on bionics principle and intuition
CN114995479A (en) Parameter control method of quadruped robot virtual model controller based on reinforcement learning
CN112596534A (en) Gait training method and device for quadruped robot based on deep reinforcement learning, electronic equipment and medium
Kim et al. Stability control for dynamic walking of bipedal robot with real-time capture point trajectory optimization
CN114047697B (en) Four-foot robot balance inverted pendulum control method based on deep reinforcement learning
Pandala et al. Robust predictive control for quadrupedal locomotion: Learning to close the gap between reduced-and full-order models
JP6781101B2 (en) Non-linear system control method, biped robot control device, biped robot control method and its program
CN106094817B (en) Intensified learning humanoid robot gait's planing method based on big data mode
CN114397810B (en) Motion control method of four-foot robot based on self-adaptive virtual model control
CN110737195A (en) Biped robot walking foot placement point planning method and device based on speed control
CN116619382A (en) Robot motion control method and system and electronic equipment
CN114397899A (en) Bionic robot fish three-dimensional path tracking control method and device
CN114625129B (en) Motion control method and system of position control leg-foot robot
CN113568422B (en) Four-foot robot control method based on model predictive control optimization reinforcement learning
CN112757301B (en) Robot anti-disturbance control method and device, electronic equipment and storage medium
CN116859969A (en) Disturbance suppression and stability control method and device for foot robot
CN116237943A (en) Four-foot robot control method combined with terrain constraint
CN115951696A (en) Pavement attitude self-adaption method for quadruped robot
Xie et al. Gait optimization and energy-based stability for biped locomotion using large-scale programming

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20200428