CN113821045A

CN113821045A - Leg and foot robot reinforcement learning action generation system

Info

Publication number: CN113821045A
Application number: CN202110924651.XA
Authority: CN
Inventors: 朱秋国; 王志成; 李岸荞; 熊蓉; 吴俊�
Original assignee: Zhejiang University ZJU
Current assignee: Supcon Group Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-12-21
Anticipated expiration: 2041-08-12
Also published as: CN113821045B

Abstract

The invention discloses a leg and foot robot reinforcement learning action generation system, which uses deep reinforcement learning to carry out end-to-end training and generate an action strategy, and obtains input-output data from a simulation environment in the training process so as to learn a motion control strategy which can maximize a reward function; the method has the advantages that the initial value is generated by data pre-training generated in the operation of a traditional control method, so that the strategy is always near an expected extreme value in a parameter space in the training process, and the condition that the initial value is randomly generated to be converged to other local minimum values is avoided; and encouraging the intelligent agent in training to perform corresponding periodic actions by using the gait cycle reward signal to obtain an effective input vector. The system can be used for training the quadruped robots with various leg and foot configurations in multiple gaits and multiple actions.

Description

Leg and foot robot reinforcement learning action generation system

Technical Field

The invention belongs to the field of robot reinforcement learning, and particularly relates to a leg and foot robot reinforcement learning action generation system.

Background

The robot can be used as an important tool and a powerful assistant in production and life of people, can complete the work needing to invest a large amount of manpower without manual assistance or with little assistance, can save the working time of people in various industries, reduces the workload of people, and accordingly improves the efficiency of production labor. For example, different kinds of robots may be responsible for the logistics of freight at a quay warehouse or for assembly on a production line, freeing the laborer from heavy physical work and repetitive work. In addition, in recent years, the robot gradually expands the interactive function and participates in the service industry to perform more detailed and close-fitting activities. The wide application value enables the robot to be one of hot spots in the field of automatic research, and good development situation is kept for a long time.

A legged robot is a typical representative of mobile robots and is characterized by moving using a leg composed of links. A particular advantage of the legged-foot robot is that its high-degree-of-freedom legged-foot structure can select a contact point with the environment by active motion and move between a series of contact points. Compared with a wheel type robot which can only move on a relatively flat and structured road surface, the leg-foot robot has extremely strong adaptability to uneven ground and unknown environment, and does not need to lay the road surface for the robot independently during use, so that the robot can directly enter a working area to operate. Therefore, the leg and foot robot is very suitable for being applied to life service scenes with complex and changeable working scenes and difficult standardization, and has very wide application prospects in special environments such as scientific research and exploration, disaster area rescue, military reconnaissance and other fields.

The traditional leg and foot robot control method depends on an accurate physical model, but the modeling process cannot be very accurate, the control effect is poor through approximate equivalence, and the motion flexibility and stability of the robot are limited; meanwhile, a large number of parameters need to be set by the method, the processes almost need to be completed manually by a designer, and a great amount of work is invested to complete parameter adjustment work. When a new action is required, the designer must redo the above work and design a completely different controller. The method based on data driving can make up the problems, a physical model does not need to be additionally established in the simulation process or on-line training, the robot can freely explore a working space to obtain an optimal strategy through a simulation environment or real-machine on-line training without setting assumed limits; meanwhile, parameters are automatically adjusted in a training stage of machine learning, so that the workload of manually adjusting the parameters can be effectively avoided; when a new action needs to be executed, the network structure and the reward function are only changed without changing the architecture.

Disclosure of Invention

The invention aims to provide a leg and foot robot reinforcement learning action generation system aiming at the defects of the prior art.

The purpose of the invention is realized by the following technical scheme: a leg and foot robot reinforcement learning action generation system comprises a simulation training machine and an actual robot. The simulation training machine comprises a strategy network pre-training module, a dynamics simulation engine, a reinforcement learning training module, a reward calculation module, a strategy neural network controller and a simulation bottom layer controller. The actual robot comprises joints and a driving module thereof, a sensor and an interface thereof, a strategy neural network controller and a robot bottom controller.

The system uses deep reinforcement learning to carry out end-to-end training to generate an action strategy, and obtains observation value-action instruction data from a simulation environment of a dynamic simulation engine in the training process so as to learn a motion control strategy which can enable the reward calculation module to obtain the maximum reward function value.

The system generates an initial value of the strategy neural network controller through pre-training, ensures that the strategy is always near an expected extreme value in a parameter space in the training process, and avoids the condition of converging to other local minimum values when the initial value is randomly generated.

The system encourages the intelligent agent in training to perform corresponding periodic actions by using the gait cycle reward signal, and effective input vectors are obtained.

Further, the strategic neural network outputs as the desired positions of the various joints; the input includes the height of the center of mass of the robot, the gravity direction of the robot system, the joint angle, the angle difference of the left and right side joints, the linear velocity of the body, the angular velocity of the body and the angular velocity of the joints.

Further, the strategy neural network is pre-trained by:

and (3.1) generating the movement gait of the robot by adopting a control method based on a model, extracting a plurality of gait cycle movement data close to the expected gait and recording the data.

(3.2) processing and collating the recorded time series data into a format conforming to a neural network input data-output tag;

(3.3) randomly initializing weights of the strategic neural network, which network is consistent with the network structure used later.

And (3.4) pre-training by using the sorted data as a training set by using an optimization algorithm to obtain a preliminarily fitted neural network.

Further, the process of performing reinforcement learning training in the simulation environment specifically includes:

(4.1) starting a dynamic simulation engine, initializing the whole environment and loading the robot model with the added disturbance.

And (4.2) completing simulation operation and acquiring observation-action-reward tracks by using a serialization mechanism and a strategy neural network in the simulation module.

(4.3) adjusting parameters of the neural network using an optimization algorithm within the training module.

And (4.4) returning to (4.2) and repeating the iteration until the training is finished.

Further, in order to keep robustness in training, noise is added to dynamic parameters of the model during initialization; the dynamic parameters of the added noise comprise the mass of the connecting rod, the inertia of the connecting rod, the position of the mass center of the connecting rod, the ground friction coefficient and the ground recovery coefficient.

Further, the reward function is derived by directly calculating the algebraic sum as follows:

forward speed reward:

wherein f is₁Is a speed clipping flag when

When the value of (d) is less than the upper speed limit set in the program, f₁If 1, otherwise f₁＝0，k₁Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,

is a velocity vector of the lower robot trunk expressed by a world coordinate system,

is a unit vector of the world coordinate system which represents the expected movement speed direction of the robot.

The joint moment cost is as follows:

wherein, c₂And k₂Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,

is the moment vector of the robot joint.

Trunk pitch cost:

wherein, c₃And k₃And expressing the coefficient needing to be adjusted, wherein delta t is the time interval of calculating the reward value of the simulation environment, and theta is the pitch angle of the robot trunk.

Joint velocity cost:

wherein, c₄And k₄Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,

is the velocity vector of the robot joint.

Leg joint consistency penalty:

wherein k is₅Indicating the coefficients that need to be adjusted,

the joint angles of the four legs of the robot are left front, left back, right front and right back.

Contact point slip penalty:

wherein k is₆Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,

the sliding speed of the contact point between the robot foot and the ground is 0 if the foot does not contact the ground.

Gait periodicity cost:

wherein k represents a coefficient to be adjusted, G (i, t) represents whether the ith foot is in contact with the ground or not, P (i) is a template representing gait and consists of 1 and-1, the feet with the same number land at the same time, and omega is the circular frequency of the expected gait.

The falling cost is as follows:

r_fall＝k_fall

wherein k is_fallIndicating the parameters that need to be adjusted.

Further, the serialization mode used in the training process of the system is as follows:

(7.1) establishing a sequence of object compositions of a single agent environment and a data buffer area with a corresponding size;

(7.2) allocating individual threads to each environment according to the data;

(7.3) sequentially retrieving the observation data vectors and the reward values from the single environment and storing the observation data vectors and the reward values in a data buffer;

(7.4) judging whether each environment finishes iteration, if so, sending the observed value-action-reward track in the buffer area into a neural network optimization program for training, emptying the buffer area and reinitializing the environment;

(7.5) operating the strategic neural network forward, and calculating action instructions of each environment and storing the action instructions into a buffer area by using data in the buffer area;

(7.6) sending the action command into a corresponding environment;

and (7.7) controlling all environments to run forward for a time step, and returning to (7.3) for next iteration.

Further, the strategic neural network employs an MLP with two hidden layers, each hidden layer having 256 nodes, using a hyperbolic tangent function as an activation function.

Further, the training method of the strategic neural network uses an Actor-Critic framework and carries out optimization training by using a PPO algorithm.

Further, the system migrates the strategic neural network that has been trained in the simulation to the actual robot by:

(10.1) storing the trained strategic neural network as a pkl format file;

(10.2) reading the pkl file, taking out the strategy neural network weight values in the pkl file, and transferring the weight values of each layer of the network into a csv comma separator file;

and (10.3) configuring input and output interfaces and a neural network reading and calculating module on the actual robot.

The invention has the beneficial effects that: the invention discloses a set of deep reinforcement learning action generation system, which is completely self-learning and end-to-end in the process, and does not need to additionally establish a physical model, so that a complex manual parameter setting process can be avoided, the design working efficiency is improved, and a more bionic and efficient motion mode can be generated; the expert data is used for pre-training before reinforcement learning training, so that the training blindness caused by random initialization is avoided, the strategy training can be ensured to be converged to an expected state, and the training success rate is effectively improved; during the training process, multi-thread parallel training is used, so that the training speed is improved; meanwhile, the method is transplanted to an actual robot through simulation, so that the mechanical loss cost and the training time cost of hardware debugging can be effectively saved, only a neural network is needed to perform forward operation when the actual robot runs, the occupied operation resources are less, and the method is a more efficient method. The system can be used for training the quadruped robots with various leg and foot configurations in multiple gaits and multiple actions.

Drawings

FIG. 1 is a system configuration diagram of an operation generation system of the present invention;

FIG. 2 is a flow chart illustrating the implementation of the action generating system of the present invention.

Detailed Description

The leg and foot robot reinforcement learning action generation system is based on deep reinforcement learning, realizes motion controller generation without large parameter adjustment, and improves training efficiency by using an expert data pre-training mode. The invention also performs the experimental test of the running and jumping actions on the dead Mini of the quadruped robot.

Fig. 1 exemplarily shows the system structure of the present invention, which can be divided into two parts, namely a simulation training machine and an actual robot.

The simulation training machine comprises a simulation environment 101 (dynamic simulation engine), a simulation bottom layer controller 102, a strategy neural network 103, a reward calculation module 104 and a training module 105 for neural network reinforcement learning; wherein the initial value of the strategic neural network 103 at the beginning of training is specified by the final value of the pre-trained network 106. The pre-training network 106 adopts a traditional deep learning method, and data generated during operation is an initial value.

The system components on the actual robot include a strategic neural network 107, a robot underlying controller 108, joint drives 109, and a sensor interface 110.

On the simulation training machine, a simulation environment 101 acquires joint force input given by a simulation bottom layer controller 102, and outputs information such as the position, posture and speed of the robot in the environment after dynamic simulation to form a vector, which is called an observation value. The observation vector is sent to the reward calculation module 104 and the policy neural network 103; the reward calculation module 104 calculates a reward value and combines the reward value, the observation value and the action value into an observation-action-reward track, and the strategy neural network 103 calculates an action instruction at the next moment, including the expected position of the robot; the reward value is sent to the neural network training module 105, which calculates the variation of the parameters of the neural network and sends it to the strategic neural network 103 to update it; the next time the desired position of the robot is sent to the simulation infrastructure controller 102 to calculate the joint force required to reach that position.

In an actual robot, the sensor interface 110 reads joint information and body posture of the robot, calculates an observation value and transmits the observation value to the strategic neural network 107, the strategic neural network 107 calculates a desired position of the robot at the next time, and further transmits the position to the robot underlying controller 108 to calculate joint force required for reaching the position, and the joint force data is transmitted to the joint driver 109 as a command to control joint movement.

It should be noted that the system structure shown in fig. 1 is only an example used for the training of motor skills, and the present invention is not limited thereto.

Based on the above description, fig. 2 shows in detail a specific process implemented by the present invention under the structure shown in fig. 1, further illustrating the technical content and features of the present invention:

step S201, selecting input-output according to the characteristics of the task action. The observations typically include the angle, velocity of the joints and pose, velocity of the robot body parts. In this example, a sequence as shown in table 1 is selected as an observation value (input of the network), and a desired angle of each joint is selected as an output (action command).

Table 1: observation value segmentation meaning table

Meaning of variables	Starting sequence number	Variable length	Unit of	Coarse range of data
					Height of robot mass center	0	1	m	[0.1,0.9]
Direction of gravity of robot system	1	3	Dimensionless	[0.0,1.0]
					Joint angle	4	12	rad	[-π,π]
Angular difference between left and right side joints	16	4	rad	[-π,π]
					Linear velocity of body	20	3	m/s	[-3.0,3.0]
Angular velocity of body	23	3	rad/s	[-π,π]
					Angular velocity of joint	27	12	rad/s	[-14.5,14.5]

And S202, collecting data when the robot performs approximate action, and arranging the data into a training data-label form according to the input and output requirements of the S202.

Step S203, pre-training the strategy neural network by using a deep learning optimization algorithm.

In this example, the pre-training is performed 1500 times by using a random gradient descent SGD and Adam optimizer and using the mean square error MSE as an index.

In this example, the strategic neural network is trained under a Tensorflow framework using a multi-layered perceptron MLP with two hidden layers, each hidden layer having 256 nodes, and using a hyperbolic tangent function as an activation function.

Step S204, a simulation environment sequence is initialized randomly. The sequence of simulation environments consists of simulation environments of a plurality of individual agents.

Further, the simulation environment of a single agent runs according to the following flow:

and (4.1) starting a dynamic simulation engine, initializing the whole environment, marking each variable and loading the robot model.

And (4.2) loading the pre-training strategy neural network obtained in the step S203.

And (4.3) adding random disturbance to various parameters of the robot model, such as mass, inertia and centroid position.

Specifically, the perturbation parameters added to the model at initialization obeyed a uniform random distribution with ranges and units as in table 4.

Table 4: noise adding kinetic parameters

Step S205, performing reinforcement learning training on the strategy neural network in the simulation environment, and adjusting the weight of the strategy neural network by using an optimization algorithm in the training module.

(5.1) information on the robot joints and postures is read from the simulation environment and converted into standard quantities conforming to a standard distribution (mean 0, variance 1) in accordance with the mean and variance of each physical quantity.

(5.2) calculating a value of the reward function based on the information extracted from the environment.

Specifically, the reward function is obtained by directly calculating the algebraic sum as follows:

forward velocity reward r_forward：

Wherein f is₁Is a speed clipping flag when

When the value of (d) is less than the upper speed limit set in the program, f₁When 1 is equal to

When the value of (d) is not less than the upper limit of the speed set in the program, f₁＝0，k₁Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,

Joint moment cost r_torque：

is the moment vector of the robot joint.

Trunk pitch cost r_torque：

Velocity penalty of joint r_qdot：

is the velocity vector of the robot joint.

Leg joint consistency penalty r_gait：

Wherein k is₅Indicating the coefficients that need to be adjusted,

Contact point slip penalty r_slip：

the sliding speed of the contact point between the robot foot and the ground is 0 if the foot does not contact the ground. And l belongs to {0,1,2,3} and corresponds to the feet of the four legs of the robot, namely the front left leg, the rear left leg, the front right leg and the rear right leg respectively.

Periodic cost of gait r_contact：

Falling cost r_fall：

r_fall＝k_fall

Wherein k is_fallIndicating the parameters that need to be adjusted.

(5.3) judging whether the robot meets termination conditions, such as tumbling or reaching termination time, if so, terminating simulation and updating the strategy neural network parameters once by using a neural network training module; if not, the simulation program continues to simulate the motion of the robot in the environment and continuously sends the generated observed value, action command and reward value into a cache region of a serialization mechanism to form a track.

Specifically, the serialization manner used in the training process of the reinforcement learning training system is as follows:

(5.3.1) establishing a sequence of object compositions of a single agent environment and a data buffer of corresponding size.

(5.3.2) allocating individual threads to each context according to the data.

(5.3.3) taking the normalized observed value, action command and reward value from the single environment in sequence and storing the normalized observed value, action command and reward value in a data buffer area to form a track.

(5.3.4) judging whether each environment finishes iteration, if so, sending the observed value-action-reward track in the buffer area into a neural network optimization program in a training module for training, emptying the buffer area and reinitializing the environment.

(5.3.5) the action commands and the input buffer of each environment are calculated by using the data in the strategic neural network and the buffer.

(5.3.6) the motion command is sent to the corresponding environment as the desired position of the joint.

And (5.3.7) controlling all environments to run forward for a time step, and returning to (5.3.3) to carry out the next iteration.

(5.4) feeding the normalized observation vector into the serialization layer and obtaining the expected positions of the joints from the serialization layer.

And (5.5) the expected position vector is sent to a feedforward PD controller in a bottom layer control layer, a moment instruction vector is calculated together with the current joint position and speed, and the moment instruction vector is sent to the robot in the simulation environment to drive the robot to move.

Specifically, the training method of the strategic neural network uses an Actor-Critic framework and performs optimization training by using a PPO algorithm.

And step S206, deriving the trained strategic neural network.

Specifically, the trained strategy neural network is exported from the training frame and stored as a pkl format file; and then reading the pkl file by using another process, taking out the strategy neural network weight values in the pkl file, and transferring the weight of each layer of the network into a csv comma separator file according to the structure of the network. The operation enables the network to be separated from a complex training framework, only basic operation functions need to be built when the network is further used, other complex algorithms do not need to be loaded, and the frequency of the robot in online operation is kept.

And step S207, building an input/output interface of the real machine.

Specifically, the input and output interfaces and the neural network reading and calculating module on the actual robot are selectively configured according to the input and output designed in step S201.

In this case, additional filtering or calculation is required for physical quantities that cannot be directly read from the actual robot or that have large noise. In this example, the mass center linear velocity, the linear acceleration and the angular velocity read by the inertial measurement unit of the robot have large noise and drift phenomenon, so that the data directly read and the velocity data calculated by using the kinematics of the robot are fused by using kalman filtering to obtain a stable velocity measurement value. Meanwhile, the robot cannot directly measure the height value of the body, so that the height of the body is calculated by using the leg joint angles of the robot and the kinematics of the robot.

In step S208, the strategic neural network 107 is deployed on the actual robot. In order to ensure the frequency of online operation on the actual robot, the system does not perform further learning training on the actual robot, and only performs forward operation to obtain control output.

Specifically, a data structure required by the neural network is initialized, and the neural network weight in the csv format derived in step S206 is read by using the neural network reading interface established in step S207 and stored in the network structure. During operation, the strategic neural network obtains the observation value vector through the measurement input interface 110 constructed in step S207, obtains the expected joint position at the next moment after operation, and sends the expected joint position to the robot underlying controller 108. The robot lower layer controller 108 outputs a desired joint torque and transmits it to the joint drive 109.

The structure and parameters of the robot underlying controller 108 used in the actual robot should be consistent with those of the simulation underlying controller 102 in the simulation environment.

In this example, the control signal frequency on the actual robot is 5 ms.

Claims

1. A leg and foot robot reinforcement learning action generation system is characterized by comprising a simulation training machine and an actual robot. The simulation training machine comprises a strategy network pre-training module, a dynamics simulation engine, a reinforcement learning training module, a reward calculation module, a strategy neural network controller and a simulation bottom layer controller. The actual robot comprises joints and a driving module thereof, a sensor and an interface thereof, a strategy neural network controller and a robot bottom controller.

2. The legged-foot robot reinforcement learning action generation system according to claim 1, wherein the strategic neural network outputs the desired positions of the respective joints; the input includes the height of the center of mass of the robot, the gravity direction of the robot system, the joint angle, the angle difference of the left and right side joints, the linear velocity of the body, the angular velocity of the body and the angular velocity of the joints.

3. The legged-foot robot reinforcement learning action generation system according to claim 1, wherein the strategic neural network is pre-trained by:

and (3.3) randomly initializing the weights of the strategy neural network.

4. The leg-foot robot reinforcement learning action generation system according to claim 1, wherein the reinforcement learning training process in the simulation environment is specifically:

5. The leg-foot robot reinforcement learning action generation system according to claim 4, wherein the system adds noise to the dynamic parameters of the model during initialization in order to maintain robustness during training; the dynamic parameters of the added noise comprise the mass of the connecting rod, the inertia of the connecting rod, the position of the mass center of the connecting rod, the ground friction coefficient and the ground recovery coefficient.

6. The leg and foot robot reinforcement learning action generating system according to claim 4, wherein the reward function is obtained by directly calculating an algebraic sum by:

forward speed reward:

wherein f is₁Is a speed clipping flag when

The joint moment cost is as follows:

is the moment vector of the robot joint.

Trunk pitch cost:

Joint velocity cost:

is the velocity vector of the robot joint.

Leg joint consistency penalty:

wherein k is₅Indicating the coefficients that need to be adjusted,

Contact point slip penalty:

Gait periodicity cost:

The falling cost is as follows:

r_fall＝k_fall

wherein k is_fallIndicating the parameters that need to be adjusted.

7. The leg and foot robot reinforcement learning action generation system according to claim 4, wherein the serialization way used in the training process of the system is as follows:

(7.2) allocating individual threads to each environment according to the data;

(7.6) sending the action command into a corresponding environment;

8. The legged-foot robot reinforcement learning action generation system according to claim 1, wherein said strategic neural network employs MLP with two hidden layers, each hidden layer having 256 nodes, using hyperbolic tangent function as activation function.

9. The leg-foot robot reinforcement learning action generation system as claimed in claim 1, wherein the training method of the strategic neural network uses an Actor-Critic framework for optimal training with a PPO algorithm.

10. The leg-foot robot reinforcement learning action generation system according to claim 1, wherein the system migrates the strategic neural network that has been trained in the simulation to the actual robot by:

(10.1) storing the trained strategic neural network as a pkl format file;