CN113821045A - Leg and foot robot reinforcement learning action generation system - Google Patents

Leg and foot robot reinforcement learning action generation system Download PDF

Info

Publication number
CN113821045A
CN113821045A CN202110924651.XA CN202110924651A CN113821045A CN 113821045 A CN113821045 A CN 113821045A CN 202110924651 A CN202110924651 A CN 202110924651A CN 113821045 A CN113821045 A CN 113821045A
Authority
CN
China
Prior art keywords
robot
training
neural network
reinforcement learning
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110924651.XA
Other languages
Chinese (zh)
Other versions
CN113821045B (en
Inventor
朱秋国
王志成
李岸荞
熊蓉
吴俊�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Supcon Group Co Ltd
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202110924651.XA priority Critical patent/CN113821045B/en
Publication of CN113821045A publication Critical patent/CN113821045A/en
Application granted granted Critical
Publication of CN113821045B publication Critical patent/CN113821045B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/08Control of attitude, i.e. control of roll, pitch, or yaw
    • G05D1/0891Control of attitude, i.e. control of roll, pitch, or yaw specially adapted for land vehicles
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a leg and foot robot reinforcement learning action generation system, which uses deep reinforcement learning to carry out end-to-end training and generate an action strategy, and obtains input-output data from a simulation environment in the training process so as to learn a motion control strategy which can maximize a reward function; the method has the advantages that the initial value is generated by data pre-training generated in the operation of a traditional control method, so that the strategy is always near an expected extreme value in a parameter space in the training process, and the condition that the initial value is randomly generated to be converged to other local minimum values is avoided; and encouraging the intelligent agent in training to perform corresponding periodic actions by using the gait cycle reward signal to obtain an effective input vector. The system can be used for training the quadruped robots with various leg and foot configurations in multiple gaits and multiple actions.

Description

Leg and foot robot reinforcement learning action generation system
Technical Field
The invention belongs to the field of robot reinforcement learning, and particularly relates to a leg and foot robot reinforcement learning action generation system.
Background
The robot can be used as an important tool and a powerful assistant in production and life of people, can complete the work needing to invest a large amount of manpower without manual assistance or with little assistance, can save the working time of people in various industries, reduces the workload of people, and accordingly improves the efficiency of production labor. For example, different kinds of robots may be responsible for the logistics of freight at a quay warehouse or for assembly on a production line, freeing the laborer from heavy physical work and repetitive work. In addition, in recent years, the robot gradually expands the interactive function and participates in the service industry to perform more detailed and close-fitting activities. The wide application value enables the robot to be one of hot spots in the field of automatic research, and good development situation is kept for a long time.
A legged robot is a typical representative of mobile robots and is characterized by moving using a leg composed of links. A particular advantage of the legged-foot robot is that its high-degree-of-freedom legged-foot structure can select a contact point with the environment by active motion and move between a series of contact points. Compared with a wheel type robot which can only move on a relatively flat and structured road surface, the leg-foot robot has extremely strong adaptability to uneven ground and unknown environment, and does not need to lay the road surface for the robot independently during use, so that the robot can directly enter a working area to operate. Therefore, the leg and foot robot is very suitable for being applied to life service scenes with complex and changeable working scenes and difficult standardization, and has very wide application prospects in special environments such as scientific research and exploration, disaster area rescue, military reconnaissance and other fields.
The traditional leg and foot robot control method depends on an accurate physical model, but the modeling process cannot be very accurate, the control effect is poor through approximate equivalence, and the motion flexibility and stability of the robot are limited; meanwhile, a large number of parameters need to be set by the method, the processes almost need to be completed manually by a designer, and a great amount of work is invested to complete parameter adjustment work. When a new action is required, the designer must redo the above work and design a completely different controller. The method based on data driving can make up the problems, a physical model does not need to be additionally established in the simulation process or on-line training, the robot can freely explore a working space to obtain an optimal strategy through a simulation environment or real-machine on-line training without setting assumed limits; meanwhile, parameters are automatically adjusted in a training stage of machine learning, so that the workload of manually adjusting the parameters can be effectively avoided; when a new action needs to be executed, the network structure and the reward function are only changed without changing the architecture.
Disclosure of Invention
The invention aims to provide a leg and foot robot reinforcement learning action generation system aiming at the defects of the prior art.
The purpose of the invention is realized by the following technical scheme: a leg and foot robot reinforcement learning action generation system comprises a simulation training machine and an actual robot. The simulation training machine comprises a strategy network pre-training module, a dynamics simulation engine, a reinforcement learning training module, a reward calculation module, a strategy neural network controller and a simulation bottom layer controller. The actual robot comprises joints and a driving module thereof, a sensor and an interface thereof, a strategy neural network controller and a robot bottom controller.
The system uses deep reinforcement learning to carry out end-to-end training to generate an action strategy, and obtains observation value-action instruction data from a simulation environment of a dynamic simulation engine in the training process so as to learn a motion control strategy which can enable the reward calculation module to obtain the maximum reward function value.
The system generates an initial value of the strategy neural network controller through pre-training, ensures that the strategy is always near an expected extreme value in a parameter space in the training process, and avoids the condition of converging to other local minimum values when the initial value is randomly generated.
The system encourages the intelligent agent in training to perform corresponding periodic actions by using the gait cycle reward signal, and effective input vectors are obtained.
Further, the strategic neural network outputs as the desired positions of the various joints; the input includes the height of the center of mass of the robot, the gravity direction of the robot system, the joint angle, the angle difference of the left and right side joints, the linear velocity of the body, the angular velocity of the body and the angular velocity of the joints.
Further, the strategy neural network is pre-trained by:
and (3.1) generating the movement gait of the robot by adopting a control method based on a model, extracting a plurality of gait cycle movement data close to the expected gait and recording the data.
(3.2) processing and collating the recorded time series data into a format conforming to a neural network input data-output tag;
(3.3) randomly initializing weights of the strategic neural network, which network is consistent with the network structure used later.
And (3.4) pre-training by using the sorted data as a training set by using an optimization algorithm to obtain a preliminarily fitted neural network.
Further, the process of performing reinforcement learning training in the simulation environment specifically includes:
(4.1) starting a dynamic simulation engine, initializing the whole environment and loading the robot model with the added disturbance.
And (4.2) completing simulation operation and acquiring observation-action-reward tracks by using a serialization mechanism and a strategy neural network in the simulation module.
(4.3) adjusting parameters of the neural network using an optimization algorithm within the training module.
And (4.4) returning to (4.2) and repeating the iteration until the training is finished.
Further, in order to keep robustness in training, noise is added to dynamic parameters of the model during initialization; the dynamic parameters of the added noise comprise the mass of the connecting rod, the inertia of the connecting rod, the position of the mass center of the connecting rod, the ground friction coefficient and the ground recovery coefficient.
Further, the reward function is derived by directly calculating the algebraic sum as follows:
forward speed reward:
Figure BDA0003208799650000021
wherein f is1Is a speed clipping flag when
Figure BDA0003208799650000022
When the value of (d) is less than the upper speed limit set in the program, f1If 1, otherwise f1=0,k1Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA0003208799650000031
is a velocity vector of the lower robot trunk expressed by a world coordinate system,
Figure BDA0003208799650000032
is a unit vector of the world coordinate system which represents the expected movement speed direction of the robot.
The joint moment cost is as follows:
Figure BDA0003208799650000033
wherein, c2And k2Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA0003208799650000034
is the moment vector of the robot joint.
Trunk pitch cost:
Figure BDA0003208799650000035
wherein, c3And k3And expressing the coefficient needing to be adjusted, wherein delta t is the time interval of calculating the reward value of the simulation environment, and theta is the pitch angle of the robot trunk.
Joint velocity cost:
Figure BDA0003208799650000036
wherein, c4And k4Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA0003208799650000037
is the velocity vector of the robot joint.
Leg joint consistency penalty:
Figure BDA0003208799650000038
wherein k is5Indicating the coefficients that need to be adjusted,
Figure BDA0003208799650000039
the joint angles of the four legs of the robot are left front, left back, right front and right back.
Contact point slip penalty:
Figure BDA00032087996500000310
wherein k is6Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA00032087996500000311
the sliding speed of the contact point between the robot foot and the ground is 0 if the foot does not contact the ground.
Gait periodicity cost:
Figure BDA00032087996500000312
Figure BDA00032087996500000313
wherein k represents a coefficient to be adjusted, G (i, t) represents whether the ith foot is in contact with the ground or not, P (i) is a template representing gait and consists of 1 and-1, the feet with the same number land at the same time, and omega is the circular frequency of the expected gait.
The falling cost is as follows:
rfall=kfall
wherein k isfallIndicating the parameters that need to be adjusted.
Further, the serialization mode used in the training process of the system is as follows:
(7.1) establishing a sequence of object compositions of a single agent environment and a data buffer area with a corresponding size;
(7.2) allocating individual threads to each environment according to the data;
(7.3) sequentially retrieving the observation data vectors and the reward values from the single environment and storing the observation data vectors and the reward values in a data buffer;
(7.4) judging whether each environment finishes iteration, if so, sending the observed value-action-reward track in the buffer area into a neural network optimization program for training, emptying the buffer area and reinitializing the environment;
(7.5) operating the strategic neural network forward, and calculating action instructions of each environment and storing the action instructions into a buffer area by using data in the buffer area;
(7.6) sending the action command into a corresponding environment;
and (7.7) controlling all environments to run forward for a time step, and returning to (7.3) for next iteration.
Further, the strategic neural network employs an MLP with two hidden layers, each hidden layer having 256 nodes, using a hyperbolic tangent function as an activation function.
Further, the training method of the strategic neural network uses an Actor-Critic framework and carries out optimization training by using a PPO algorithm.
Further, the system migrates the strategic neural network that has been trained in the simulation to the actual robot by:
(10.1) storing the trained strategic neural network as a pkl format file;
(10.2) reading the pkl file, taking out the strategy neural network weight values in the pkl file, and transferring the weight values of each layer of the network into a csv comma separator file;
and (10.3) configuring input and output interfaces and a neural network reading and calculating module on the actual robot.
The invention has the beneficial effects that: the invention discloses a set of deep reinforcement learning action generation system, which is completely self-learning and end-to-end in the process, and does not need to additionally establish a physical model, so that a complex manual parameter setting process can be avoided, the design working efficiency is improved, and a more bionic and efficient motion mode can be generated; the expert data is used for pre-training before reinforcement learning training, so that the training blindness caused by random initialization is avoided, the strategy training can be ensured to be converged to an expected state, and the training success rate is effectively improved; during the training process, multi-thread parallel training is used, so that the training speed is improved; meanwhile, the method is transplanted to an actual robot through simulation, so that the mechanical loss cost and the training time cost of hardware debugging can be effectively saved, only a neural network is needed to perform forward operation when the actual robot runs, the occupied operation resources are less, and the method is a more efficient method. The system can be used for training the quadruped robots with various leg and foot configurations in multiple gaits and multiple actions.
Drawings
FIG. 1 is a system configuration diagram of an operation generation system of the present invention;
FIG. 2 is a flow chart illustrating the implementation of the action generating system of the present invention.
Detailed Description
The leg and foot robot reinforcement learning action generation system is based on deep reinforcement learning, realizes motion controller generation without large parameter adjustment, and improves training efficiency by using an expert data pre-training mode. The invention also performs the experimental test of the running and jumping actions on the dead Mini of the quadruped robot.
Fig. 1 exemplarily shows the system structure of the present invention, which can be divided into two parts, namely a simulation training machine and an actual robot.
The simulation training machine comprises a simulation environment 101 (dynamic simulation engine), a simulation bottom layer controller 102, a strategy neural network 103, a reward calculation module 104 and a training module 105 for neural network reinforcement learning; wherein the initial value of the strategic neural network 103 at the beginning of training is specified by the final value of the pre-trained network 106. The pre-training network 106 adopts a traditional deep learning method, and data generated during operation is an initial value.
The system components on the actual robot include a strategic neural network 107, a robot underlying controller 108, joint drives 109, and a sensor interface 110.
On the simulation training machine, a simulation environment 101 acquires joint force input given by a simulation bottom layer controller 102, and outputs information such as the position, posture and speed of the robot in the environment after dynamic simulation to form a vector, which is called an observation value. The observation vector is sent to the reward calculation module 104 and the policy neural network 103; the reward calculation module 104 calculates a reward value and combines the reward value, the observation value and the action value into an observation-action-reward track, and the strategy neural network 103 calculates an action instruction at the next moment, including the expected position of the robot; the reward value is sent to the neural network training module 105, which calculates the variation of the parameters of the neural network and sends it to the strategic neural network 103 to update it; the next time the desired position of the robot is sent to the simulation infrastructure controller 102 to calculate the joint force required to reach that position.
In an actual robot, the sensor interface 110 reads joint information and body posture of the robot, calculates an observation value and transmits the observation value to the strategic neural network 107, the strategic neural network 107 calculates a desired position of the robot at the next time, and further transmits the position to the robot underlying controller 108 to calculate joint force required for reaching the position, and the joint force data is transmitted to the joint driver 109 as a command to control joint movement.
It should be noted that the system structure shown in fig. 1 is only an example used for the training of motor skills, and the present invention is not limited thereto.
Based on the above description, fig. 2 shows in detail a specific process implemented by the present invention under the structure shown in fig. 1, further illustrating the technical content and features of the present invention:
step S201, selecting input-output according to the characteristics of the task action. The observations typically include the angle, velocity of the joints and pose, velocity of the robot body parts. In this example, a sequence as shown in table 1 is selected as an observation value (input of the network), and a desired angle of each joint is selected as an output (action command).
Table 1: observation value segmentation meaning table
Meaning of variables Starting sequence number Variable length Unit of Coarse range of data
Height of robot mass center 0 1 m [0.1,0.9]
Direction of gravity of robot system 1 3 Dimensionless [0.0,1.0]
Joint angle 4 12 rad [-π,π]
Angular difference between left and right side joints 16 4 rad [-π,π]
Linear velocity of body 20 3 m/s [-3.0,3.0]
Angular velocity of body 23 3 rad/s [-π,π]
Angular velocity of joint 27 12 rad/s [-14.5,14.5]
And S202, collecting data when the robot performs approximate action, and arranging the data into a training data-label form according to the input and output requirements of the S202.
Step S203, pre-training the strategy neural network by using a deep learning optimization algorithm.
In this example, the pre-training is performed 1500 times by using a random gradient descent SGD and Adam optimizer and using the mean square error MSE as an index.
In this example, the strategic neural network is trained under a Tensorflow framework using a multi-layered perceptron MLP with two hidden layers, each hidden layer having 256 nodes, and using a hyperbolic tangent function as an activation function.
Step S204, a simulation environment sequence is initialized randomly. The sequence of simulation environments consists of simulation environments of a plurality of individual agents.
Further, the simulation environment of a single agent runs according to the following flow:
and (4.1) starting a dynamic simulation engine, initializing the whole environment, marking each variable and loading the robot model.
And (4.2) loading the pre-training strategy neural network obtained in the step S203.
And (4.3) adding random disturbance to various parameters of the robot model, such as mass, inertia and centroid position.
Specifically, the perturbation parameters added to the model at initialization obeyed a uniform random distribution with ranges and units as in table 4.
Table 4: noise adding kinetic parameters
Figure BDA0003208799650000061
Step S205, performing reinforcement learning training on the strategy neural network in the simulation environment, and adjusting the weight of the strategy neural network by using an optimization algorithm in the training module.
(5.1) information on the robot joints and postures is read from the simulation environment and converted into standard quantities conforming to a standard distribution (mean 0, variance 1) in accordance with the mean and variance of each physical quantity.
(5.2) calculating a value of the reward function based on the information extracted from the environment.
Specifically, the reward function is obtained by directly calculating the algebraic sum as follows:
forward velocity reward rforward
Figure BDA0003208799650000071
Wherein f is1Is a speed clipping flag when
Figure BDA0003208799650000072
When the value of (d) is less than the upper speed limit set in the program, f1When 1 is equal to
Figure BDA0003208799650000073
When the value of (d) is not less than the upper limit of the speed set in the program, f1=0,k1Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA0003208799650000074
is a velocity vector of the lower robot trunk expressed by a world coordinate system,
Figure BDA0003208799650000075
is a unit vector of the world coordinate system which represents the expected movement speed direction of the robot.
Joint moment cost rtorque
Figure BDA0003208799650000076
Wherein, c2And k2Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA0003208799650000077
is the moment vector of the robot joint.
Trunk pitch cost rtorque
Figure BDA0003208799650000078
Wherein, c3And k3And expressing the coefficient needing to be adjusted, wherein delta t is the time interval of calculating the reward value of the simulation environment, and theta is the pitch angle of the robot trunk.
Velocity penalty of joint rqdot
Figure BDA0003208799650000079
Wherein, c4And k4Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA00032087996500000710
is the velocity vector of the robot joint.
Leg joint consistency penalty rgait
Figure BDA00032087996500000711
Wherein k is5Indicating the coefficients that need to be adjusted,
Figure BDA00032087996500000712
the joint angles of the four legs of the robot are left front, left back, right front and right back.
Contact point slip penalty rslip
Figure BDA00032087996500000713
Wherein k is6Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure BDA0003208799650000081
the sliding speed of the contact point between the robot foot and the ground is 0 if the foot does not contact the ground. And l belongs to {0,1,2,3} and corresponds to the feet of the four legs of the robot, namely the front left leg, the rear left leg, the front right leg and the rear right leg respectively.
Periodic cost of gait rcontact
Figure BDA0003208799650000082
Figure BDA0003208799650000083
Wherein k represents a coefficient to be adjusted, G (i, t) represents whether the ith foot is in contact with the ground or not, P (i) is a template representing gait and consists of 1 and-1, the feet with the same number land at the same time, and omega is the circular frequency of the expected gait.
Falling cost rfall
rfall=kfall
Wherein k isfallIndicating the parameters that need to be adjusted.
(5.3) judging whether the robot meets termination conditions, such as tumbling or reaching termination time, if so, terminating simulation and updating the strategy neural network parameters once by using a neural network training module; if not, the simulation program continues to simulate the motion of the robot in the environment and continuously sends the generated observed value, action command and reward value into a cache region of a serialization mechanism to form a track.
Specifically, the serialization manner used in the training process of the reinforcement learning training system is as follows:
(5.3.1) establishing a sequence of object compositions of a single agent environment and a data buffer of corresponding size.
(5.3.2) allocating individual threads to each context according to the data.
(5.3.3) taking the normalized observed value, action command and reward value from the single environment in sequence and storing the normalized observed value, action command and reward value in a data buffer area to form a track.
(5.3.4) judging whether each environment finishes iteration, if so, sending the observed value-action-reward track in the buffer area into a neural network optimization program in a training module for training, emptying the buffer area and reinitializing the environment.
(5.3.5) the action commands and the input buffer of each environment are calculated by using the data in the strategic neural network and the buffer.
(5.3.6) the motion command is sent to the corresponding environment as the desired position of the joint.
And (5.3.7) controlling all environments to run forward for a time step, and returning to (5.3.3) to carry out the next iteration.
(5.4) feeding the normalized observation vector into the serialization layer and obtaining the expected positions of the joints from the serialization layer.
And (5.5) the expected position vector is sent to a feedforward PD controller in a bottom layer control layer, a moment instruction vector is calculated together with the current joint position and speed, and the moment instruction vector is sent to the robot in the simulation environment to drive the robot to move.
Specifically, the training method of the strategic neural network uses an Actor-Critic framework and performs optimization training by using a PPO algorithm.
And step S206, deriving the trained strategic neural network.
Specifically, the trained strategy neural network is exported from the training frame and stored as a pkl format file; and then reading the pkl file by using another process, taking out the strategy neural network weight values in the pkl file, and transferring the weight of each layer of the network into a csv comma separator file according to the structure of the network. The operation enables the network to be separated from a complex training framework, only basic operation functions need to be built when the network is further used, other complex algorithms do not need to be loaded, and the frequency of the robot in online operation is kept.
And step S207, building an input/output interface of the real machine.
Specifically, the input and output interfaces and the neural network reading and calculating module on the actual robot are selectively configured according to the input and output designed in step S201.
In this case, additional filtering or calculation is required for physical quantities that cannot be directly read from the actual robot or that have large noise. In this example, the mass center linear velocity, the linear acceleration and the angular velocity read by the inertial measurement unit of the robot have large noise and drift phenomenon, so that the data directly read and the velocity data calculated by using the kinematics of the robot are fused by using kalman filtering to obtain a stable velocity measurement value. Meanwhile, the robot cannot directly measure the height value of the body, so that the height of the body is calculated by using the leg joint angles of the robot and the kinematics of the robot.
In step S208, the strategic neural network 107 is deployed on the actual robot. In order to ensure the frequency of online operation on the actual robot, the system does not perform further learning training on the actual robot, and only performs forward operation to obtain control output.
Specifically, a data structure required by the neural network is initialized, and the neural network weight in the csv format derived in step S206 is read by using the neural network reading interface established in step S207 and stored in the network structure. During operation, the strategic neural network obtains the observation value vector through the measurement input interface 110 constructed in step S207, obtains the expected joint position at the next moment after operation, and sends the expected joint position to the robot underlying controller 108. The robot lower layer controller 108 outputs a desired joint torque and transmits it to the joint drive 109.
The structure and parameters of the robot underlying controller 108 used in the actual robot should be consistent with those of the simulation underlying controller 102 in the simulation environment.
In this example, the control signal frequency on the actual robot is 5 ms.

Claims (10)

1. A leg and foot robot reinforcement learning action generation system is characterized by comprising a simulation training machine and an actual robot. The simulation training machine comprises a strategy network pre-training module, a dynamics simulation engine, a reinforcement learning training module, a reward calculation module, a strategy neural network controller and a simulation bottom layer controller. The actual robot comprises joints and a driving module thereof, a sensor and an interface thereof, a strategy neural network controller and a robot bottom controller.
The system uses deep reinforcement learning to carry out end-to-end training to generate an action strategy, and obtains observation value-action instruction data from a simulation environment of a dynamic simulation engine in the training process so as to learn a motion control strategy which can enable the reward calculation module to obtain the maximum reward function value.
The system generates an initial value of the strategy neural network controller through pre-training, ensures that the strategy is always near an expected extreme value in a parameter space in the training process, and avoids the condition of converging to other local minimum values when the initial value is randomly generated.
The system encourages the intelligent agent in training to perform corresponding periodic actions by using the gait cycle reward signal, and effective input vectors are obtained.
2. The legged-foot robot reinforcement learning action generation system according to claim 1, wherein the strategic neural network outputs the desired positions of the respective joints; the input includes the height of the center of mass of the robot, the gravity direction of the robot system, the joint angle, the angle difference of the left and right side joints, the linear velocity of the body, the angular velocity of the body and the angular velocity of the joints.
3. The legged-foot robot reinforcement learning action generation system according to claim 1, wherein the strategic neural network is pre-trained by:
and (3.1) generating the movement gait of the robot by adopting a control method based on a model, extracting a plurality of gait cycle movement data close to the expected gait and recording the data.
(3.2) processing and collating the recorded time series data into a format conforming to a neural network input data-output tag;
and (3.3) randomly initializing the weights of the strategy neural network.
And (3.4) pre-training by using the sorted data as a training set by using an optimization algorithm to obtain a preliminarily fitted neural network.
4. The leg-foot robot reinforcement learning action generation system according to claim 1, wherein the reinforcement learning training process in the simulation environment is specifically:
(4.1) starting a dynamic simulation engine, initializing the whole environment and loading the robot model with the added disturbance.
And (4.2) completing simulation operation and acquiring observation-action-reward tracks by using a serialization mechanism and a strategy neural network in the simulation module.
(4.3) adjusting parameters of the neural network using an optimization algorithm within the training module.
And (4.4) returning to (4.2) and repeating the iteration until the training is finished.
5. The leg-foot robot reinforcement learning action generation system according to claim 4, wherein the system adds noise to the dynamic parameters of the model during initialization in order to maintain robustness during training; the dynamic parameters of the added noise comprise the mass of the connecting rod, the inertia of the connecting rod, the position of the mass center of the connecting rod, the ground friction coefficient and the ground recovery coefficient.
6. The leg and foot robot reinforcement learning action generating system according to claim 4, wherein the reward function is obtained by directly calculating an algebraic sum by:
forward speed reward:
Figure FDA0003208799640000021
wherein f is1Is a speed clipping flag when
Figure FDA0003208799640000022
When the value of (d) is less than the upper speed limit set in the program, f1If 1, otherwise f1=0,k1Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure FDA0003208799640000023
is a velocity vector of the lower robot trunk expressed by a world coordinate system,
Figure FDA0003208799640000024
is a unit vector of the world coordinate system which represents the expected movement speed direction of the robot.
The joint moment cost is as follows:
Figure FDA0003208799640000025
wherein, c2And k2Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure FDA0003208799640000026
is the moment vector of the robot joint.
Trunk pitch cost:
Figure FDA0003208799640000027
wherein, c3And k3And expressing the coefficient needing to be adjusted, wherein delta t is the time interval of calculating the reward value of the simulation environment, and theta is the pitch angle of the robot trunk.
Joint velocity cost:
Figure FDA0003208799640000028
wherein, c4And k4Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure FDA0003208799640000029
is the velocity vector of the robot joint.
Leg joint consistency penalty:
Figure FDA00032087996400000210
wherein k is5Indicating the coefficients that need to be adjusted,
Figure FDA00032087996400000211
the joint angles of the four legs of the robot are left front, left back, right front and right back.
Contact point slip penalty:
Figure FDA00032087996400000212
wherein k is6Indicating the factor that needs to be adjusted, at is the time interval over which the simulated environment calculates the prize value,
Figure FDA00032087996400000213
the sliding speed of the contact point between the robot foot and the ground is 0 if the foot does not contact the ground.
Gait periodicity cost:
Figure FDA0003208799640000031
Figure FDA0003208799640000032
wherein k represents a coefficient to be adjusted, G (i, t) represents whether the ith foot is in contact with the ground or not, P (i) is a template representing gait and consists of 1 and-1, the feet with the same number land at the same time, and omega is the circular frequency of the expected gait.
The falling cost is as follows:
rfall=kfall
wherein k isfallIndicating the parameters that need to be adjusted.
7. The leg and foot robot reinforcement learning action generation system according to claim 4, wherein the serialization way used in the training process of the system is as follows:
(7.1) establishing a sequence of object compositions of a single agent environment and a data buffer area with a corresponding size;
(7.2) allocating individual threads to each environment according to the data;
(7.3) sequentially retrieving the observation data vectors and the reward values from the single environment and storing the observation data vectors and the reward values in a data buffer;
(7.4) judging whether each environment finishes iteration, if so, sending the observed value-action-reward track in the buffer area into a neural network optimization program for training, emptying the buffer area and reinitializing the environment;
(7.5) operating the strategic neural network forward, and calculating action instructions of each environment and storing the action instructions into a buffer area by using data in the buffer area;
(7.6) sending the action command into a corresponding environment;
and (7.7) controlling all environments to run forward for a time step, and returning to (7.3) for next iteration.
8. The legged-foot robot reinforcement learning action generation system according to claim 1, wherein said strategic neural network employs MLP with two hidden layers, each hidden layer having 256 nodes, using hyperbolic tangent function as activation function.
9. The leg-foot robot reinforcement learning action generation system as claimed in claim 1, wherein the training method of the strategic neural network uses an Actor-Critic framework for optimal training with a PPO algorithm.
10. The leg-foot robot reinforcement learning action generation system according to claim 1, wherein the system migrates the strategic neural network that has been trained in the simulation to the actual robot by:
(10.1) storing the trained strategic neural network as a pkl format file;
(10.2) reading the pkl file, taking out the strategy neural network weight values in the pkl file, and transferring the weight values of each layer of the network into a csv comma separator file;
and (10.3) configuring input and output interfaces and a neural network reading and calculating module on the actual robot.
CN202110924651.XA 2021-08-12 2021-08-12 Reinforced learning action generating system of leg-foot robot Active CN113821045B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110924651.XA CN113821045B (en) 2021-08-12 2021-08-12 Reinforced learning action generating system of leg-foot robot

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110924651.XA CN113821045B (en) 2021-08-12 2021-08-12 Reinforced learning action generating system of leg-foot robot

Publications (2)

Publication Number Publication Date
CN113821045A true CN113821045A (en) 2021-12-21
CN113821045B CN113821045B (en) 2023-07-07

Family

ID=78913157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110924651.XA Active CN113821045B (en) 2021-08-12 2021-08-12 Reinforced learning action generating system of leg-foot robot

Country Status (1)

Country Link
CN (1) CN113821045B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114563954A (en) * 2022-02-28 2022-05-31 山东大学 Quadruped robot motion control method based on reinforcement learning and position increment
CN114667852A (en) * 2022-03-14 2022-06-28 广西大学 Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning
CN114839884A (en) * 2022-07-05 2022-08-02 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN116627041A (en) * 2023-07-19 2023-08-22 江西机电职业技术学院 Control method for motion of four-foot robot based on deep learning
CN117215204A (en) * 2023-11-09 2023-12-12 中国科学院自动化研究所 Robot gait training method and system based on reinforcement learning

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657881A (en) * 2019-01-14 2019-04-19 南京国电南自电网自动化有限公司 A kind of neural network photovoltaic power generation prediction technique and system suitable for small sample
CN110637308A (en) * 2017-05-10 2019-12-31 瑞典爱立信有限公司 Pre-training system for self-learning agents in a virtualized environment
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN111360834A (en) * 2020-03-25 2020-07-03 中南大学 Humanoid robot motion control method and system based on deep reinforcement learning
CN111580385A (en) * 2020-05-11 2020-08-25 深圳阿米嘎嘎科技有限公司 Robot walking control method, system and medium based on deep reinforcement learning
CN112631131A (en) * 2020-12-19 2021-04-09 北京化工大学 Motion control self-generation and physical migration method for quadruped robot
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
CN112904848A (en) * 2021-01-18 2021-06-04 长沙理工大学 Mobile robot path planning method based on deep reinforcement learning
CN112936290A (en) * 2021-03-25 2021-06-11 西湖大学 Quadruped robot motion planning method based on layered reinforcement learning
CN113031528A (en) * 2021-02-25 2021-06-25 电子科技大学 Multi-legged robot motion control method based on depth certainty strategy gradient
CN113110442A (en) * 2021-04-09 2021-07-13 深圳阿米嘎嘎科技有限公司 Method, system and medium for controlling multi-skill movement of quadruped robot
CN113190029A (en) * 2021-04-06 2021-07-30 北京化工大学 Adaptive gait autonomous generation method of quadruped robot based on deep reinforcement learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110637308A (en) * 2017-05-10 2019-12-31 瑞典爱立信有限公司 Pre-training system for self-learning agents in a virtualized environment
CN109657881A (en) * 2019-01-14 2019-04-19 南京国电南自电网自动化有限公司 A kind of neural network photovoltaic power generation prediction technique and system suitable for small sample
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN111360834A (en) * 2020-03-25 2020-07-03 中南大学 Humanoid robot motion control method and system based on deep reinforcement learning
CN111580385A (en) * 2020-05-11 2020-08-25 深圳阿米嘎嘎科技有限公司 Robot walking control method, system and medium based on deep reinforcement learning
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
CN112631131A (en) * 2020-12-19 2021-04-09 北京化工大学 Motion control self-generation and physical migration method for quadruped robot
CN112904848A (en) * 2021-01-18 2021-06-04 长沙理工大学 Mobile robot path planning method based on deep reinforcement learning
CN113031528A (en) * 2021-02-25 2021-06-25 电子科技大学 Multi-legged robot motion control method based on depth certainty strategy gradient
CN112936290A (en) * 2021-03-25 2021-06-11 西湖大学 Quadruped robot motion planning method based on layered reinforcement learning
CN113190029A (en) * 2021-04-06 2021-07-30 北京化工大学 Adaptive gait autonomous generation method of quadruped robot based on deep reinforcement learning
CN113110442A (en) * 2021-04-09 2021-07-13 深圳阿米嘎嘎科技有限公司 Method, system and medium for controlling multi-skill movement of quadruped robot

Non-Patent Citations (14)

* Cited by examiner, † Cited by third party
Title
JEEVES LOPES DOS SANTOS: "Gait Synthesis of a Hybrid Legged Robot Using Reinforcement Learning", 《2015 ANNUAL IEEE SYSTEMS CONFERENCE (SYSCON) PROCEEDINGS》 *
JEEVES LOPES DOS SANTOS: "Gait Synthesis of a Hybrid Legged Robot Using Reinforcement Learning", 《2015 ANNUAL IEEE SYSTEMS CONFERENCE (SYSCON) PROCEEDINGS》, 16 April 2015 (2015-04-16) *
SEHOON HA等: "Automated Deep Reinforcement Learning Environment for Hardware of a Modular Legged Robot", 《2018 15TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS (UR)》 *
SEHOON HA等: "Automated Deep Reinforcement Learning Environment for Hardware of a Modular Legged Robot", 《2018 15TH INTERNATIONAL CONFERENCE ON UBIQUITOUS ROBOTS (UR)》, 30 June 2018 (2018-06-30) *
TAISUKE KOBAYASHI: "Reinforcement learning for quadrupedal locomotion with design of continual–hierarchical curriculum", 《ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE》 *
TAISUKE KOBAYASHI: "Reinforcement learning for quadrupedal locomotion with design of continual–hierarchical curriculum", 《ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE》, 31 August 2020 (2020-08-31) *
傅汇乔等: "基于深度强化学习的六足机器人运动规划", 《智能科学与技术学报》 *
傅汇乔等: "基于深度强化学习的六足机器人运动规划", 《智能科学与技术学报》, 31 December 2020 (2020-12-31) *
崔俊文: "基于分层学习的四足机器人运动自适应控制模型", 《计算机测量与控制》 *
崔俊文: "基于分层学习的四足机器人运动自适应控制模型", 《计算机测量与控制》, 31 December 2020 (2020-12-31) *
张浩昱等: "基于近端策略优化算法的四足机器人步态控制研究", 《空间控制技术与应用》 *
张浩昱等: "基于近端策略优化算法的四足机器人步态控制研究", 《空间控制技术与应用》, 30 June 2019 (2019-06-30) *
郭宪: "仿生机器人运动步态控制:强化学习方法综述", 《智能***学报》 *
郭宪: "仿生机器人运动步态控制:强化学习方法综述", 《智能***学报》, 31 January 2020 (2020-01-31) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114563954A (en) * 2022-02-28 2022-05-31 山东大学 Quadruped robot motion control method based on reinforcement learning and position increment
WO2023159978A1 (en) * 2022-02-28 2023-08-31 山东大学 Quadruped robot motion control method based on reinforcement learning and position increment
CN114667852A (en) * 2022-03-14 2022-06-28 广西大学 Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning
CN114667852B (en) * 2022-03-14 2023-04-14 广西大学 Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning
CN114839884A (en) * 2022-07-05 2022-08-02 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN114839884B (en) * 2022-07-05 2022-09-30 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN116627041A (en) * 2023-07-19 2023-08-22 江西机电职业技术学院 Control method for motion of four-foot robot based on deep learning
CN116627041B (en) * 2023-07-19 2023-09-29 江西机电职业技术学院 Control method for motion of four-foot robot based on deep learning
CN117215204A (en) * 2023-11-09 2023-12-12 中国科学院自动化研究所 Robot gait training method and system based on reinforcement learning
CN117215204B (en) * 2023-11-09 2024-02-02 中国科学院自动化研究所 Robot gait training method and system based on reinforcement learning

Also Published As

Publication number Publication date
CN113821045B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
CN113821045B (en) Reinforced learning action generating system of leg-foot robot
CN109343341B (en) Carrier rocket vertical recovery intelligent control method based on deep reinforcement learning
CN114603564B (en) Mechanical arm navigation obstacle avoidance method, system, computer equipment and storage medium
WO2022012265A1 (en) Robot learning from demonstration via meta-imitation learning
CN110502033B (en) Fixed-wing unmanned aerial vehicle cluster control method based on reinforcement learning
CN112936290B (en) Quadruped robot motion planning method based on layered reinforcement learning
Farchy et al. Humanoid robots learning to walk faster: From the real world to simulation and back
CN107234617A (en) A kind of obstacle-avoiding route planning method of the unrelated Artificial Potential Field guiding of avoidance task
CN113031528B (en) Multi-legged robot non-structural ground motion control method based on depth certainty strategy gradient
CN112034888A (en) Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle
CN113478486A (en) Robot motion parameter self-adaptive control method and system based on deep reinforcement learning
CN113190029B (en) Adaptive gait autonomous generation method of four-footed robot based on deep reinforcement learning
CN117215204B (en) Robot gait training method and system based on reinforcement learning
Hu et al. Learning a faster locomotion gait for a quadruped robot with model-free deep reinforcement learning
Gutzeit et al. The besman learning platform for automated robot skill learning
CN117606490A (en) Collaborative search path planning method for autonomous underwater vehicle
CN114779661B (en) Chemical synthesis robot system based on multi-classification generation confrontation imitation learning algorithm
Shim et al. Evolving flying creatures with path following behaviors
Jiang et al. Motion sequence learning for robot walking based on pose optimization
RU2816639C1 (en) Method for creating controllers for controlling walking robots based on reinforcement learning
Danielsen Vision-based robotic grasping in simulation using deep reinforcement learning
Chen et al. Learning hardware dynamics model from experiments for locomotion optimization
CN115097853B (en) Unmanned aerial vehicle maneuvering flight control method based on fine granularity repetition strategy
CN117572877B (en) Biped robot gait control method, biped robot gait control device, storage medium and equipment
Raja Mohamed et al. Biologically inspired design framework for robot in dynamic environments using Framsticks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240625

Address after: Building 5, 5th Floor, No. 309 Liuhe Road, Binjiang District, Hangzhou City, Zhejiang Province, 310000

Patentee after: Supcon Group Co.,Ltd.

Country or region after: China

Address before: 310058 Yuhang Tang Road, Xihu District, Hangzhou, Zhejiang 866

Patentee before: ZHEJIANG University

Country or region before: China