CN116062059B

CN116062059B - Single-leg robot continuous jump control method based on deep reinforcement learning

Info

Publication number: CN116062059B
Application number: CN202310087358.1A
Authority: CN
Inventors: 明爱国; 孙能祥; 孟非; 顾赛; 刘伯韬; 黄强
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2024-06-14
Anticipated expiration: 2043-02-09
Also published as: CN116062059A

Abstract

The invention discloses a single-leg robot continuous jump control method based on deep reinforcement learning, which calculates real-time leg length according to the positive kinematics of a robot, compares the real-time leg length with a target leg length preset in a finite state machine, and judges whether jump stage switching is performed or not; the virtual spring-damping module calculates the virtual force at the tail end of the foot according to the current jumping stage, so as to obtain the required joint moment and control the robot to move; the rigidity and the damping in the virtual spring-damping model are updated by the action space information which is output by the strategy network, the preset target leg length in the finite state machine is updated according to the corresponding setting in the algorithm, and the strategy network is optimized according to the rewarding numerical value in a fixed period. The method avoids the tedious and low-efficiency parameter tuning process of the manually designed controller, and realizes the motion factors with real-time change requirements for the robot on-line autonomous decision according to the motion condition of the robot.

Description

Single-leg robot continuous jump control method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of robot control, and particularly relates to a single-leg robot continuous jump control method based on deep reinforcement learning.

Background

For legged robots, it is critical in the environment to be able to overcome large gaps and high platform jumping capability. Jumping is characterized by large instantaneous force, fast force change speed and short duration. Biomechanical studies have shown that the maximum jump height that can be achieved by a biological individual depends on three mechanical features of its lower limb: maximum force that can be generated, maximum extension speed, and extension range. Therefore, reasonable control and planning of these three mechanical features during robotic jumping is critical to fully mining its jumping ability. In addition, the animal can convert the gravity and kinetic energy of the body into strain energy by adjusting the rigidity of the skeletal muscle system of the leg when running so as to adapt to the impact caused by landing and re-accelerate the body of the animal, so that the movement of the animal becomes more flexible and stable. For continuous jumping motions, the landing buffer process of the robot directly determines the height and stability of the robot for jumping again, and the concept of the adjustable rigidity of animals provides a new concept for better processing the landing buffer process of the robot.

At present, a control method for continuous jump of a single-leg robot is complex in modeling calculation and difficult to realize on a controller in real time, or a control model is designed manually, so that complicated and low-efficiency manual parameter adjustment is needed, and certain limitation exists. While some studies have attempted to apply factors such as stiffness, leg length, etc. directly related to jumping motion as variable factors to continuous jumping studies, the related work is found in simulations where conditions are ideal, lacking corresponding experimental verification.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a single-leg robot continuous jump control method based on deep reinforcement learning, which fully considers the motion factors with real-time change requirements in continuous jump motion based on the theory of a VMC (virtual model control) method, and provides a dynamic jump self-adaptive learning framework by training a single-leg robot motion controller through DRL (deep reinforcement learning), wherein a strategy obtained by simulation training can be directly deployed to a physical single-leg robot platform, so that stable and highly ideal continuous jump motion can be realized.

The present invention achieves the technical object by the following means.

A single-leg robot continuous jump control method based on deep reinforcement learning comprises the following steps:

modeling the robot dynamics into a virtual spring-damping model, calculating the real-time leg length according to the positive kinematics of the robot, comparing the real-time leg length with a target leg length preset in a finite state machine, and judging whether to switch the jumping stage;

The virtual spring-damping module calculates the virtual force at the tail end of the foot according to the current jumping stage, so that the required joint moment is obtained, and the robot is controlled to move;

the rigidity and the damping in the virtual spring-damping model are updated by the rigidity coefficient and the damping coefficient in the action space information;

The target leg length comprises a jump leg length, a flight leg length and a landing buffer leg length, and the target leg length preset in the finite state machine is updated according to the following mode: except the first jump, when the real-time leg length is smaller than the landing buffer leg length, the jump leg length and the flight leg length are updated; updating the length of the landing buffer leg when the robot jumps to the highest point and the absolute speed is 0;

the action space information is output of a strategy network, the input of the strategy network is observation space information, the strategy network is optimized according to the rewarding value in a fixed period, and the rewarding value is calculated by the current state information of the robot according to a rewarding function.

Further technical scheme, the action space information is: the stiffness coefficient and the damping coefficient of the virtual spring-damping model in the Z direction, and the stiffness coefficient and the damping coefficient of the virtual spring-damping model in the X direction, and the jump leg length, the flight leg length and the landing leg length.

Further technical scheme, the observation space information is: H _sum is the accumulated ground clearance of the foot end, z' is the height of the foot end from the ground under the world coordinate system, X is the actual position of the foot end in the X direction, v _base is the absolute speed of the robot base body, θ ₁ is the angle between the robot thigh and the horizontal direction, θ ₂ is the angle between the robot thigh and the calf, and/ > For real-time angular velocity of thigh joints,/>Is the real-time angular velocity of the calf joint.

Further technical scheme, the accumulated ground-leaving height of the tail end of the foot meets the following conditions:

z'(t)＝z_base(t)-h_base-l(t)

z_base(t)＝h_p-l_base(t)

Wherein: z '(t) is the height of the tail end of the foot from the ground at the current moment, z' (t-1) is the height of the tail end of the foot from the ground at the previous moment, z _base (t) is the height of the upper plane of the sliding rail connecting platform from the ground under the world coordinate system at the current moment, h _base is the distance between the upper plane of the sliding rail connecting platform and the center of the thigh joint, l (t) is the real-time leg length at the current moment, h _p is the maximum effective stroke of the sliding rail frame, and l _base (t) is the linear distance between the position of the maximum effective stroke of the sliding rail frame and the upper plane of the sliding rail connecting platform at the current moment.

According to a further technical scheme, the reward function comprises a reward item and a punishment item, the reward item comprises a robot foot end accumulated ground clearance height reward, a continuous successful jump number reward and a task reward for completing a set jump number, and the punishment item comprises punishment of unstable horizontal direction of the tail end of the foot in the air jump gesture of the robot, high energy consumption of punishment joints, exceeding limit of punishment joints and overtime of training.

Further technical scheme, the rewarding function specifically comprises:

Air jump attitude: k ₁ x;

accumulated ground clearance: k ₂*H_sum;

Number of successful hops: k ₃ _times_jump_flag;

Energy loss:

Whether the goal is completed: success_flag k ₅+Error_Flag*k₆;

Wherein: k ₁、k₂、k₃、k₄、k₅、k₆ are coefficients, X is the actual position of the tail end of the foot in the X direction, jump_times is shown, jump_flag is shown, τ ₁ is the joint moment required by the thigh joint of the robot, τ ₂ is the joint moment required by the shank joint of the robot, For real-time angular velocity of thigh joints,/>For the real-time angular velocity of the calf joint, success_flag represents a Success Flag and error_flag represents an Error Flag.

According to a further technical scheme, a first layer and a second layer of the full-connection hidden layer of the strategy network are respectively composed of 64 nodes and 32 nodes.

According to a further technical scheme, the observation space information is added with environmental noise before being input into the strategy network, and the environmental noise is added in a simulation environment.

Further technical scheme, in the simulation environment, the domain randomization is set for the robot physical parameters.

The beneficial effects of the invention are as follows:

(1) According to the continuous jump control method of the single-leg robot based on the deep reinforcement learning, the virtual model control and the deep reinforcement learning are combined, so that the complicated and low-efficiency parameter tuning process of a manually designed controller is avoided when the virtual model control is singly used;

(2) Aiming at the continuous jump task of the robot, the invention carries out characteristic engineering processing on part of information of the observation space to realize the rapid convergence of the training process;

(3) The observation space information input by the strategy network comprises the accumulated ground-leaving height of the tail end of the foot, the height of the tail end of the foot from the ground under a world coordinate system, the actual position of the tail end of the foot in the X direction, the absolute speed of a robot matrix, the included angle between the thigh and the horizontal direction of the robot, the included angle between the thigh and the calf of the robot and the real-time angular speed of the thigh joint and the calf joint, so that the variable rigidity, the damping and the target leg length can be output in a fixed period in real time according to the movement condition, the dynamic performance of the robot is fully exerted, and the autonomous decision-making capability of the robot is given.

Drawings

FIG. 1 is a simplified leg model and corresponding virtual spring-damping model system diagram of a single-leg robot of the present invention;

FIG. 2 is a schematic diagram of a finite state machine switch of a single-leg robot according to the present invention;

FIG. 3 is a schematic diagram of a high dynamic jump adaptive learning framework according to the present invention;

Fig. 4 is a schematic view of the single-leg robot according to the present invention.

Detailed Description

The invention will be further described with reference to the drawings and the specific embodiments, but the scope of the invention is not limited thereto.

1. Interpretation of the terms

Virtual model control (Virtual Model Control, VMC) is a motion control framework that simulates virtual components to generate the required joint torque, and uses the imitation of virtual mechanical components, which may include linear springs, damping, masses, or any other possible physical element, to generate the actual execution torque (or force). The virtual model control is divided into an upper controller and a lower controller, wherein the upper controller can be designed as a state machine, and virtual state conversion can be realized by simply changing the connection mode or the component parameters of the virtual mechanical component. When the virtual model control method is applied to a continuous jump scene of a single-leg robot, only the original length parameter of the virtual spring is usually set, and the robot moves according to the prompt and the proposal.

Deep reinforcement learning (deep reinforcement learning, DRL) is the application of Deep Neural Networks (DNNs) to reinforcement learning strategies. The strategy of DNNs approximation may produce similar outputs for similar inputs, making the strategy robust to various situations. By using deep reinforcement learning, researchers can get motion strategies through a large number of simulation training without accurately modeling the robot work environment and without any system prior information, and without risk of hardware damage. In addition, the robustness of the controller can be trained by using methods such as domain randomization, adding environmental noise, randomization of dynamic parameters and the like in the training process, so that the strategy trained by simulation can be directly migrated to the physical robot without further training on a physical system.

2. Two-dimensional virtual spring-damping model for foundation single-leg robot

From fig. 1, the position (x, z) of the robot foot end C within the defined coordinate system and the jacobian matrix J can be derived:

Wherein: θ ₁ represents an angle between the robot thigh and the horizontal direction, θ ₂ represents an angle between the robot thigh and the shank, L ₁ represents a length of the robot thigh, and L ₂ represents a length of the robot shank;

the foot end force of the single-leg robot is used for simulating a two-dimensional virtual spring-damping system, and a virtual spring and a damping are respectively arranged along the X, Z direction so as to ensure the motion control of two degrees of freedom of the robot; the Z-direction virtual force F _z and the X-direction virtual force F _x of the tail end of the foot are respectively calculated based on the tail end position error and the speed error of the foot:

Wherein k _z is the stiffness coefficient of the Z-direction virtual spring-damper model, k _x is the stiffness coefficient of the X-direction virtual spring-damper model, c _z is the damping coefficient of the Z-direction virtual spring-damper model, c _x is the damping coefficient of the X-direction virtual spring-damper model, Z _t is the recommended target position of the Z-direction foot end, X _t is the recommended target position of the X-direction foot end, X is the actual position of the X-direction foot end, Z is the actual position of the Z-direction foot end, v _x is the actual speed of the X-direction foot end, and v _z is the actual speed of the Z-direction foot end;

The joint moment required to generate the foot tip force is then calculated using the jacobian matrix J of the robot:

Wherein: τ ₁ is the joint moment required by the thigh joint of the robot, and τ ₂ is the joint moment required by the shank joint of the robot.

In fig. 1, M is the mass of the slide rail connection platform, I ₁ is the robot thigh inertia, M ₁ is the robot thigh mass, I ₂ is the robot shank inertia, and M ₂ is the robot shank mass.

3. Finite state machine for establishing jump phase switching

During successive jumps of a single-legged robot, each cycle can be divided into three states: jump phase, flight phase, landing buffer phase, see fig. 2. The robot reaches the jumping leg length l _j to realize the jumping under the action of the Ground Reaction Force (GRF) f _x、f_z, maintains the fixed flying leg length l _f in the air, and performs the jumping again when the real-time leg length is smaller than the landing buffer leg length l _td in the landing process. Because the three states are continuous and irreversible, virtual mechanical part switching is performed in a finite state machine by taking the size relation between the real-time leg length and the target leg length of the current stage as a judgment basis, and the robot moves according to the target position (x _t,z_t) of the next stage after switching. The position x _t in the horizontal direction is always 0 and the position z _t in the vertical direction is switched with three different jump phases.

4. Establishing a dynamic jump adaptive learning framework

The VMC models the robot dynamics as a virtual spring-damper model, using a manually designed controller to control the robot motion process. However, for manually designed controllers, there is a tedious and inefficient parameter tuning process, and the dynamic performance of the controller cannot be fully exerted according to the hardware configuration condition of the robot; meanwhile, for the high-dynamic and multi-stage motion of continuous jump of the robot, the stiffness and the damping of the current virtual spring-damping model and the target leg length of the next stage can be determined on line and autonomously in real time.

Under the inspired of the bionics theory, the invention provides a dynamic jump self-adaptive learning framework (fig. 3, a module with a shaded part only appears in simulation) for reducing unnecessary energy consumption in the process as much as possible and realizing control of continuous jumping motions on the premise of realizing expected jumping times and ideal single-jump height by using a PPO algorithm of deep reinforcement learning and combining factors possibly influencing the jumping height and stability of the legged robot (namely three mechanical characteristics of jumping and an adjustable stiffness idea of animal skeletal muscles). On the basis that a single-leg robot is modeled as a virtual spring-damper model and phase switches are performed according to real-time leg lengths, the adaptive learning architecture allows the spring stiffness, damping and target leg lengths of each phase to be changed in real time within a certain range to match the current robot motion state, and the environment is reset to start the next episode exploration each time forbidden behaviors occur in the motion process of the robot or specified expected jumping times are completed.

(1) Action space

In the research problem of the present invention, actions (actions of the agent output) are different from the control amounts of directly controlling the movement of the leg-foot robot body mentioned in the prior literature, such as: joint moment, joint or end effector reference position, etc. In order to reduce ineffective exploration of the intelligent body as much as possible, and simultaneously realize real-time variability of rigidity, damping and target leg length of the virtual spring-damping model, the range of each action is artificially given on the basis of a certain engineering experience. The action space is defined as: the virtual spring has a Z-direction stiffness coefficient [ k _{j_z},k_{f_z} ], a damping coefficient [ c _{j_z},c_{f_z} ], an X-direction stiffness coefficient [ k _{j_x},k_{f_x} ], a damping coefficient [ c _{j_x},c_{f_x} ], a jump leg length l _j, a flight leg length l _f and a landing leg length l _td, and the update frequency of the parameters in training is 50Hz.

In order to promote the height and stability of each jump as much as possible, the robot selects [ l _j,l_f ] before starting each jump, and decides the jump leg length when the jump is lifted off and the flight leg length after the jump is lifted off; inspired by how people and animals think about landing when jumping down the high place, the robot is set to select l _td when each jump reaches the highest point, namely the landing buffer leg length after the jump is touched is determined; when the real-time leg length is smaller than the landing buffer leg length in the landing process of the robot, the robot is ready to jump again.

(2) Viewing space and feature engineering process

In data analysis, the fitting pressure of a neural network to other modules is reduced by performing proper feature preprocessing on input data, so that the performance of a model is improved, and the training process is accelerated.

For continuous jump tasks, part of important information of the observation space is subjected to characteristic engineering processing so as to further accelerate rewarding convergence of the tasks. In the present invention, the observation space of an agent is defined as The vertical position of the tail end of the foot at each moment epiosde in the world coordinate system is recorded, the height z' of the tail end of the foot from the ground under the world coordinate system is calculated by the current leg length l and the real-time position z _base of the robot matrix, and z _base is measured by a laser range finder and the period is 5ms; v _base is the absolute speed of the robot matrix, which can be obtained by differentiating the real-time position z _base of the robot matrix, and is used together with z' to help the robot judge the highest point of motion and obtain the height of the foot end from the ground at the moment, so that the intelligent body can decide more proper l _td; /(I)For real-time angular velocity of thigh joints,/>Real-time angular velocity of the calf joint; θ ₁、θ₂ can help the agent in real time determine if the current joint angle exceeds the joint limit (according to the set threshold values for θ ₁ and θ ₂ and compared with the current joint angle) during the training process, which will be strictly prohibited during the training process.

For a single jump, z' increases all the time in the jump-up phase and the robot falls down after reaching the highest point. H _sum is defined as the sum of the difference between the current time height z '(t) from the ground and the previous time height z' (t-1) from the ground, i.e., the cumulative ground clearance of the foot end of the entire episode. After that, only the H _sum needs to be rewarded, the intelligent agent can gradually realize that more rewards can be brought to the next jump, and further explore the possibility of continuous jump.

z'(t)＝z_base(t)-h_base-l(t)(7)

z_base(t)＝h_p-l_base(t)(8)

Wherein: z '(t) is the height of the tail end of the foot from the ground at the current moment, z' (t-1) is the height of the tail end of the foot from the ground at the previous moment, z _base (t) is the height of the upper plane of the sliding rail connecting platform 1 from the ground under the world coordinate system at the current moment, h _base is the distance (fixed value) between the upper plane of the sliding rail connecting platform 1 and the center of thigh joints, l (t) is the real-time leg length at the current moment, h _p is the maximum effective stroke (fixed value) of the sliding rail frame 2, and l _base (t) is the linear distance between the maximum effective stroke of the sliding rail frame 2 and the upper plane of the sliding rail connecting platform 1, which is measured by a laser sensor at the current moment; the slide rail connecting platform 1 and the slide rail frame 2 are shown in fig. 4.

Other viewing space information than H _sum,z′,v_base informationThe update frequency of the control unit is the same as the control cycle frequency of the bottom controller. Furthermore, each piece of information of the observation space is normalized before being used for network training.

(3) Reward function

For the deep reinforcement learning training process, reasonable reward functions are set, so that an intelligent agent can be helped to improve learning efficiency and obtain more ideal training results. In the present invention, the bonus function includes a bonus term and a penalty term associated with the primary goal of the sport.

The bonus items include: the foot end of the robot accumulates the height rewards from the ground, the continuous successful jump times rewards and the task rewards for completing the established jump times. The first item guides the robot as real-time rewards to jump as much as possible and explores continuous jumps; the second item is used as a stage settlement reward, the jump_flag becomes True each time the robot jumps to the highest point, and records the nth Jump in episode at present, then the jump_flag becomes False, and remains maintained until the next Jump to the highest point; the third term is the final stage rewards when the robot has completed a prescribed number of hops at current episode and successfully lands smoothly, the success_flag becomes True. The penalty term includes: punishment of unstable horizontal direction of the tail end of the foot of the aerial jump gesture of the robot, high energy consumption of punishment joints, exceeding of the punishment joints to limit and overtime of training; the first two items are real-time penalties, the last two items are termination penalties, and when the situation that joint limit or training overtime (exceeding a maximum training step length specified by episode) is exceeded occurs, the error_flag becomes True at the moment, the intelligent agent receives a high penalty, and meanwhile, the current training is immediately ended. In the invention, the task with the preset jumping times is set to be three times, and whenever success_flag is True or error_flag is True in the training process, the current episode is immediately terminated.

To achieve the goal of continuous hops, a trade-off and trade-off must be made on the prize of each hop height and the number of continuous hops, since if the former is weighted more heavily, it would lead to the agent blindly pursuing a single hop height, ignoring more prizes that would be brought about by a continuous successful hop; conversely, if the reward given by the latter is too large, the agent may only take a conservative jump action, resulting in a poor jump height, with the goal of blindly pursuing successive jumps. It is clear that both of the above extreme cases are unsuitable, and the specific content of the reward function can be found in the following table:

TABLE 1 reward function

Wherein: k ₁、k₂、k₃、k₄、k₅、k₆ are all coefficients, which are empirical values.

(4) Network structure and learning algorithm

The actor and critic networks in the algorithm each have two fully connected hidden layers, the first layer and the second layer are designed to consist of 64 and 32 nodes respectively according to the action space and the observation space, and the activation function between the hidden layers is ReLU. The PPO algorithm with stable-baselines open source is used, so that efficient parallel training is supported, 64 agents are trained in parallel threads, and training time is greatly shortened. In order to be able to complete the training task faster and better, the hyper-parameters of the algorithm need to be adjusted empirically.

(5) Reducing the gap from reality

The real gap is a major factor that hinders the application of deep reinforcement learning in the robot field. In order to improve the success rate of simulating to real migration, the prior art has made a great deal of research work in this aspect, such as analyzing the potential reasons for the existence of a real gap, and reducing the gap between simulation and reality as much as possible by using methods such as system identification, using finely measured physical parameters in the simulation, establishing an accurate actuator model, simulating system delay, and the like. In the simulation environment, a motor dynamics model is built, environmental Gaussian noise is added for observing space information, a robot physical parameter setting domain is randomized, the authenticity and reliability of simulation are striven for, and meanwhile, a trained controller has good robustness. Wherein, adding environmental Gaussian noise and setting domain randomization are both prior art.

The examples are preferred embodiments of the present invention, but the present invention is not limited to the above-described embodiments, and any obvious modifications, substitutions or variations that can be made by one skilled in the art without departing from the spirit of the present invention are within the scope of the present invention.

Claims

1. A single-leg robot continuous jump control method based on deep reinforcement learning is characterized in that:

the virtual spring-damping model calculates the virtual force at the tail end of the foot according to the current jumping stage, so that the required joint moment is obtained, and the robot is controlled to move;

the action space information is output of a strategy network, the input of the strategy network is observation space information, the strategy network is optimized according to the rewarding value in a fixed period, and the rewarding value is calculated by the current state information of the robot according to a rewarding function;

the action space information is as follows: the stiffness coefficient and the damping coefficient of the virtual spring-damping model in the Z direction, and the stiffness coefficient and the damping coefficient of the virtual spring-damping model in the X direction, and the jump leg length, the flight leg length and the landing leg length;

The observation space information is: H _sum is the accumulated ground clearance of the foot end, z' is the height of the foot end from the ground in the world coordinate system, X is the position of the foot end in the X direction in the robot defined coordinate system, v _base is the absolute velocity of the robot base, θ ₁ is the angle between the robot thigh and the horizontal, θ ₂ is the angle between the robot thigh and the calf,/> For real-time angular velocity of thigh joints,/>Real-time angular velocity of the calf joint;

The accumulated ground clearance of the tail end of the foot is as follows:

z'(t)＝z_base(t)-h_base-l(t)

z_base(t)＝h_p-l_base(t)

2. The single leg robot continuous jump control method of claim 1, wherein the reward function comprises a reward item and a penalty item, the reward item comprises a robot foot end accumulated ground clearance height reward, a continuous successful jump number reward and a task reward for completing a set jump number, and the penalty item comprises a penalty robot air jump gesture foot end horizontal direction instability, a penalty joint high energy consumption, a penalty joint overrun limit and a training timeout.

3. The continuous jump control method of a single leg robot according to claim 2, characterized in that the reward function is specifically:

Air jump attitude: k ₁ x;

accumulated ground clearance: k ₂*H_sum;

Number of successful hops: k ₃ _times_jump_flag;

Energy loss:

Whether the goal is completed: success_flag k ₅+Error_Flag*k₆;

4. The continuous jump control method of a single leg robot according to claim 1, wherein the fully connected hidden layer first layer and the second layer of the policy network are composed of 64 and 32 nodes, respectively.

5. The single leg robot continuous jump control method according to claim 1, wherein said observation space information is added with environmental noise before inputting a strategy network, said environmental noise being added in a simulation environment.

6. The continuous jump control method of a single leg robot according to claim 1, wherein the robot physical parameter setting field is randomized in a simulation environment.