CN113478486A

CN113478486A - Robot motion parameter self-adaptive control method and system based on deep reinforcement learning

Info

Publication number: CN113478486A
Application number: CN202110786283.7A
Authority: CN
Inventors: 任亮; 王春雷; 杨亚; 邵海存; 张志鹏; 马保平; 彭长武; 李晓强
Original assignee: Shanghai Micro Motor Research Institute 21st Research Institute Of China Electronics Technology Corp
Current assignee: Shanghai Micro Motor Research Institute 21st Research Institute Of China Electronics Technology Corp
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-10-08
Anticipated expiration: 2041-07-12
Also published as: CN113478486B; WO2022223056A1

Abstract

The application provides a robot motion parameter self-adaptive control method and system based on deep reinforcement learning. The method comprises the following steps: building an agent in a simulation environment, the agent comprising: a strategy neural network, a value neural network and a task planning module; training a strategic neural network in the agent according to sample parameters based on guided reinforcement learning; based on layered reinforcement learning, sequentially and alternately carrying out strategy promotion and strategy evaluation on a strategy neural network and a value neural network in the intelligent agent according to a plurality of subtasks and reward functions corresponding to the subtasks to obtain a trained strategy neural network model; and outputting a control parameter optimization value to the controller according to the target task based on the trained strategy neural network model, so that the robot is controlled by the controller according to the control parameter optimization value.

Description

Robot motion parameter self-adaptive control method and system based on deep reinforcement learning

Technical Field

The application relates to the technical field of robot control, in particular to a robot motion parameter self-adaptive control method and system based on deep reinforcement learning.

Background

The control parameters play an important role in the motion performance of the quadruped robot system, and the parameter selection of the traditional controller depends on professional field knowledge and engineering experience. At present, some control methods based on deep reinforcement learning expect to realize end-to-end optimization from sensor data to motor control signals, but the technical route has long training period and difficult convergence, the stability and robustness of a control system cannot be ensured due to the inexplicability of a neural network, and if the performance of a training model is not good, only redesign and training are carried out, so that the engineering application of the deep reinforcement learning technology in robot motion control is greatly limited.

Therefore, there is a need to provide an improved solution to the above-mentioned deficiencies of the prior art.

Disclosure of Invention

The present application aims to provide a robot motion parameter adaptive control method and system based on deep reinforcement learning, so as to solve or alleviate the above problems in the prior art.

In order to achieve the above purpose, the present application provides the following technical solutions:

the application provides a robot motion parameter self-adaptive control method based on deep reinforcement learning, which comprises the following steps: step S101, constructing an intelligent agent in a simulation environment, wherein the intelligent agent comprises: a strategy neural network, a value neural network and a task planning module; step S102, based on the guided reinforcement learning, according to the sample parameters and the formula:

p＝p₀*0.99^t+l*T

training a strategic neural network in the agent; wherein the sample parameters are control parameters of a controller of the robot; a. the_l，tRepresenting control parameters to be optimized in the controller, l representing the number of trajectories for simulation training of the robot in the simulation environment, t being a time step of the simulation training, controller (S)_l，t) Represents the output of the controller of the robot, pi (S)_l，t) Representing an output of the strategic neural network, p representing a transition probability of the strategic neural network transitioning from supervised learning to autonomous learning, p₀Is a preset initial value of the transition probability; step S103, based on layered reinforcement learning, sequentially and alternately performing strategy promotion and strategy evaluation on a strategy neural network and a value neural network in the agent according to a plurality of subtasks and corresponding reward functions thereof to obtain a trained strategy neural network model; the plurality of subtasks are obtained by decomposing a target task of the robot through the task planning module, and the reward function is constructed by the task planning module according to the subtasks; and S104, outputting a control parameter optimization value to the controller according to the target task based on the trained strategy neural network model, so that the robot is controlled by the controller according to the control parameter optimization value.

Preferably, in step S101, according to the first fitting function:

A_t＝π(S_t)

building a strategic neural network of the agent in the simulation environment, wherein A_tOptimizing the value, S, for a parameter of the controller_tA state observation of the robot collected for a sensor of the robot.

Preferably, in step S101, according to the state observation value of the robot collected by the sensor of the robot and the parameter optimization value of the controller, according to a second fitting function:

Q_t＝Q(S_t，A_t)

building a value neural network of the agent in the simulation environment, wherein Q_tRepresenting an optimized value A of a controller parameter for the output of the strategic neural network_tEvaluation of (1), S_tA state observation of the robot collected for a sensor of the robot.

Preferably, in step S101, in the simulation environment, according to a formula, based on the state observation value and the environmental return of the robot collected by the sensor of the robot:

building a task planning module of the intelligent agent; wherein R is_tValue of the reward function for a subtask, r_tThe environmental reward comprises: a walking distance of the robot

Body stability of the robot

And energy consumption of said robot

The forward advancing distance of the robot along the x axis in the simulation environment;

rotating angles of the robot around coordinate axes x, y and z in the simulation environment;

and

is the offset of the robot with respect to the y-axis and z-axis,

is the torque of the motor of the robot,

is the motor speed of the robot, and delta t is a time period and represents the time taken by the robot to walk each step during simulation training; and alpha, beta and mu are weight coefficients determined according to the subtasks.

Preferably, in step S103, based on the hierarchical reinforcement learning, the agent follows the formula:

carrying out strategy promotion on the strategy neural network; and according to the formula:

performing strategy evaluation on the value neural network; wherein, theta_πWeights and biases representing the strategic neural network, L represents a total trajectory of the simulated trainingNumber, Q (S)_t，A_t) Representing a value neural network of said agent, A_tOptimizing the value, S, for a parameter of the controller_tA state observation, pi (S), of the robot acquired for a sensor of the robot_t) Representing the output of the strategic neural network at time t; theta_QWeights and biases, R, representing the value neural network_tThe reward function value of the corresponding task at the time t is represented, gamma is a discount factor, the value range of gamma is (0, 1), and Q is_t+1Representing the output, Q, of the neural network at time t +1_tRepresenting the output of the value neural network at time t.

Preferably, in step S103, the mission planning module of the agent, according to the formula:

judging the learning progress of the plurality of subtasks until the last subtask is completed to obtain the strategy neural network model; wherein l_n、l_m、l_iRespectively representing the nth, mth and ith training tracks, wherein n, m and i are positive integers,

denotes the l_nThe t time step in the bar trace corresponds to the task award value,

it represents a variable of the boolean type,

when the value of (c) is true, it indicates that the robot has fallen; epsilon and delta are different preset thresholds.

Preferably, in step S104, according to the target task, according to the formula:

outputting a control parameter optimization value to the controller so that the controller controls the robot according to the control parameter optimization value; wherein, pi^*Is composed of

Maximum-valued strategic neural network model, R_tA reward function representing an output of the mission planning module.

The embodiment of the present application further provides a robot motion parameter adaptive control system based on deep reinforcement learning, including: an agent building unit configured to build an agent in a simulation environment, the agent comprising: a strategy neural network, a value neural network and a task planning module; a first learning unit configured to, based on guided reinforcement learning, according to the sample parameters, according to a formula according to the transition probability:

p＝p₀*0.99^t+l*T

training a strategic neural network in the agent; wherein the sample parameters are control parameters of a controller of the robot; a. the_l，tRepresenting control parameters to be optimized in the controller, l representing a trajectory for simulated training of the robot in the simulation environmentTrace number, t is the time step of the simulation training, controller (S)_l，t) Represents the output of the controller of the robot, pi (S)_l，t) Representing an output of the strategic neural network, p representing a transition probability of the strategic neural network transitioning from supervised learning to autonomous learning, p₀Is the initial value of the transition probability; the second learning unit is configured to perform strategy promotion and strategy evaluation on the strategy neural network and the value neural network in the agent alternately in sequence according to a plurality of subtasks and reward functions corresponding to the subtasks based on layered reinforcement learning to obtain a trained strategy neural network model; the plurality of subtasks are obtained by decomposing a target task of the robot through the task planning module, and the reward function is constructed by the task planning module according to the subtasks; and the optimization unit is configured to output a control parameter optimization value to the controller according to the target task based on the trained strategy neural network model, so that the robot is controlled by the controller according to the control parameter optimization value.

Compared with the closest prior art, the technical scheme of the embodiment of the application has the following beneficial effects:

in the technical scheme provided by the embodiment of the application, an intelligent body comprising a strategy neural network, a value neural network and a task planning module is constructed in a simulation environment, control parameters of a robot controller are used as samples through guided reinforcement learning, a supervised learning label is provided for the output action of the strategy neural network of the intelligent body, the decision of the strategy neural network is effectively guided to avoid a known low-return space region, exploration time is reduced, the transition from supervised learning to autonomous learning is realized according to probability in the training process, and the generalization capability of the intelligent body can be effectively improved; through layered reinforcement learning, a plurality of subtasks are divided into a target task according to a task planning module, corresponding reward functions are constructed according to the subtasks, strategy promotion and strategy evaluation are respectively carried out on a strategy neural network and a value neural network in an intelligent body in a dynamic planning mode, the difficulty of each subtask is ensured to be matched with the decision-making capability of the intelligent body in a corresponding learning stage, an optimal strategy neural network is obtained, the robot can adaptively optimize controller parameters according to the environment condition of the robot and the state of the robot under the condition of no manual parameter adjustment, and the environment adaptability and the robustness of the robot are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. Wherein:

fig. 1 is a schematic flowchart of a robot motion parameter adaptive control method based on deep reinforcement learning according to some embodiments of the present application;

FIG. 2 is a network architecture diagram of a strategic neural network provided in accordance with some embodiments of the present application;

FIG. 3 is a network architecture diagram of a value neural network provided in accordance with some embodiments of the present application;

FIG. 4 is a schematic diagram of a robot motion parameter adaptive control system based on deep reinforcement learning according to some embodiments of the present application;

fig. 5 is a schematic diagram of a system architecture for adaptive control of motion parameters of a quadruped robot based on deep reinforcement learning according to some embodiments of the present application.

Detailed Description

The present application will be described in detail below with reference to the embodiments with reference to the attached drawings. The various examples are provided by way of explanation of the application and are not limiting of the application. In fact, it will be apparent to those skilled in the art that modifications and variations can be made in the present application without departing from the scope or spirit of the application. For instance, features illustrated or described as part of one embodiment, can be used with another embodiment to yield a still further embodiment. It is therefore intended that the present application cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Preferably, in the embodiment of the present application, the robot in the simulation environment refers to a simulation model of the robot, and a simulation model of a quadruped robot is used. The controller for controlling the motion of the quadruped robot adopts a layered control structure, and comprises a high-level leg track control, a gait control and a bottom-level leg control. The leg control aims to solve the problem of stability, the gait control aims to solve the problem of coordinated movement of four legs, and the leg track control aims to accurately model the interaction between the leg control and the ground. The parameter optimization values of the controller of the robot include: the step length, the step frequency, the leg raising height and the ground contact force of the robot.

Fig. 1 is a schematic flowchart of a robot motion parameter adaptive control method based on deep reinforcement learning according to some embodiments of the present application; as shown in fig. 1, the robot motion parameter adaptive control method based on deep reinforcement learning includes:

step S101, constructing an intelligent agent in a simulation environment, wherein the intelligent agent comprises: a strategy neural network, a value neural network and a task planning module;

in the embodiment of the application, the simulation models of the intelligent agent and the robot need to be constructed in the simulation environment, the simulation environment is built based on pybull, the simulation models of the robot can be conveniently loaded from URDF, SDF, MJCF and other file formats, and kinematics, dynamic simulation calculation, collision detection, interference query and the like of the robot can be provided.

In the embodiment of the application, an agent is constructed in a simulation environment, the robot control is trained, and after the agent finishes learning, the trained strategic neural network is deployed on the real robot. Therefore, the problem that the robot frequently falls down or the motor exceeds the limit position and the like caused by poor control effect on the robot at the early learning stage of the intelligent body is effectively avoided, and hardware damage of the robot is avoided.

In the embodiment of the application, in the constructed intelligent agent, the strategy neural network is used for outputting the optimized value of the control parameter of the controller of the robot, the value neural network is used for evaluating the output effect of the strategy neural network, the task planning module is used for decomposing the target task of the robot into a plurality of subtasks, and constructing the corresponding reward function according to each subtask to ensure that the difficulty of each subtask is matched with the decision-making capability of the intelligent agent in the corresponding learning stage.

In the embodiment of the application, the learning task of the robot is to realize stable walking on complex terrain and reduce energy consumption at the same time. The task planning module breaks the task into three subtasks: the first subtask is to realize that the robot walks for a sufficient distance to avoid falling or stepping in place, regardless of the quality of the finished action; the second subtask is to ensure the stability of the robot during movement and reduce the vibration and shake of the machine body; the third subtask aims at achieving the lowest energy consumption on the basis of the first two goals.

In some alternative embodiments, a strategic neural network of the agent is constructed in the simulation environment according to the first fitting function. Wherein the first fitting function is shown in formula (1), and formula (1) is as follows:

A_t＝π(S_t)………………………………(1)

in the formula, A_tOptimizing the value, S, for a parameter of the controller_tA state observation of the robot collected for a sensor of the robot. Specifically, the state observation value of the robot is collected by an inertial measurement unit and a foot end pressure sensor of the robot, wherein the state observation value includes: leg phase, touchdown detection, quaternion, angular velocity, and linear acceleration of the robot.

In the embodiment of the application, the input layer of the strategic neural network is the state observed value S of the robot_t(ii) a The hidden layer of the strategy neural network has 4 layers, wherein the activation function of the hidden layer of the first 3 layers adopts Tanh (32), and the activation function of the hidden layer of the 4 th layer adopts Tanh (4); the output layer of the strategic neural network is a parameter optimization value A of the controller_t(ii) a The network structure of the strategic neural network is shown in figure 2.

In some optional embodiments, a value neural network of the intelligent agent is built in the simulation environment according to the state observation value of the robot and the parameter optimization value of the controller, wherein the state observation value of the robot is collected by the sensor of the robot, and the parameter optimization value of the controller is collected by the sensor of the robot. Wherein the second fitting function is shown in formula (2), and formula (2) is as follows:

Q_t＝Q(S_t，A_t)………………………………(2)

in the formula, Q_tRepresenting an optimized value A of a controller parameter for the output of the strategic neural network_tBy using a parameter-optimized value A for the characterization controller_tGood or bad control effect when controlling the robot, S_tA state observation of the robot collected for a sensor of the robot.

In the embodiment of the application, the input layer of the value neural network is (S)_t，A_t) (ii) a The value neural network comprises 2 hidden layers, and the activation functions of the 2 hidden layers all adopt Relu (32); optimizing value A for controller parameter by output layer of value neural network_tEvaluation of (4); the network structure of the value neural network is shown in fig. 3.

In some optional embodiments, in the simulation environment, a task planning module of the agent is built according to formula (3) according to the state observation value and the environmental reward of the robot, which are collected by a sensor of the robot. Wherein, the formula (3) is as follows:

in the formula, R_tValue of the reward function for a subtask, r_tThe environmental reward comprises: a walking distance of the robot

Body stability of the robot

And energy consumption of said robot

and

is the offset of the robot with respect to the y-axis and z-axis,

is the torque of the motor of the robot,

In the embodiment of the application, the task planning module decomposes the task into three subtasks, wherein alpha, beta and mu have different given weight combinations according to different subtask values. In the first subtask, the robot is realized to walk a sufficient distance to avoid a fall or a step in place, regardless of the quality of the motion completion, (α, β, μ) — (0.07, 0.05, 0.03); in the second subtask, the stability of the robot during movement is ensured, and the vibration and the shake of the machine body are reduced, (alpha, beta, mu) ═ 0.07, 0.09, 0.03); in the third subtask, the goal is to achieve the lowest energy consumption on the basis of the first two goals, (α, β, μ) ═ 0.07, 0.09, 0.05).

Step S102, training a strategic neural network in the intelligent agent according to a formula (4) and a formula (5) based on guided reinforcement learning and according to sample parameters; the formula (4) and the formula (5) are as follows:

p＝p₀*0.99^t+l*r……………………(5)

wherein the sample parameter is a control parameter of a controller of the robot; a. the_l，tRepresenting control parameters to be optimized in the controller, l representing the number of trajectories for simulation training of the robot in the simulation environment, t being a time step of the simulation training, controller (S)_l，t) Represents the output of the controller of the robot, pi (S)_l，t) Representing an output of the strategic neural network, p representing a transition probability of the strategic neural network transitioning from supervised learning to autonomous learning, p₀Is a preset initial value of the transition probability. Here, it is to be noted that p₀The larger the value of (A), the slower the process of the strategy neural network from supervised learning to autonomous learning.

In the embodiment of the application, through guided reinforcement learning, the control parameters of the robot controller are used as samples, a label for supervised learning is provided for the output action of the strategy neural network of the agent, the decision of the strategy neural network is effectively guided to avoid the known low-return spatial region, the exploration time is reduced, the transition from the supervised learning to the autonomous learning is realized according to the probability in the training process, and the generalization capability of the agent can be effectively improved.

Step S103, based on layered reinforcement learning, sequentially and alternately performing strategy promotion and strategy evaluation on a strategy neural network and a value neural network in the agent according to a plurality of subtasks and corresponding reward functions thereof to obtain a trained strategy neural network model; the plurality of subtasks are obtained by decomposing a target task of the robot through the task planning module, and the reward function is constructed by the task planning module according to the subtasks;

in the embodiment of the application, through layered reinforcement learning, a plurality of subtasks are divided into a target task according to a task planning module, corresponding reward functions are constructed according to the subtasks, policy promotion and policy evaluation are respectively carried out on a policy neural network and a value neural network in an intelligent body in a dynamic planning mode, the difficulty of each subtask is ensured to be matched with the decision-making capability of the intelligent body in a corresponding learning stage, an optimal policy neural network is obtained, the robot can adaptively optimize controller parameters according to the environment condition of the robot and the state of the robot under the condition of no manual parameter adjustment, and the environment adaptability and the robustness of the robot are improved.

In some optional embodiments, based on hierarchical reinforcement learning, the agent performs policy boosting on the policy neural network according to equation (6); and performing strategy evaluation on the value neural network according to a formula (7). Equation (6) is as follows:

equation (7) is as follows:

in the formula, theta_πRepresenting weights and biases of the strategic neural network, L representing the total number of trajectories of the simulated training, Q (S)_t，A_t) Representing a value neural network of said agent, A_tOptimizing the value, S, for a parameter of the controller_tA state observation, pi (S), of the robot acquired for a sensor of the robot_t) Representing the output of the strategic neural network at time t; theta_QWeights and biases, R, representing the value neural network_tThe reward function value of the corresponding task at the time t is represented, gamma is a discount factor, the value range of gamma is (0, 1), and Q is_t+1Representing the output, Q, of the neural network at time t +1_tRepresenting the output of the value neural network at time t.

In a specific example, the task planning module of the agent determines the learning progress of the plurality of subtasks according to a formula (8) until the last subtask is completed, so as to obtain the strategic neural network model. Wherein, the formula (8) is as follows:

in the formula I_n、l_m、l_iRespectively representing the nth, mth and ith training tracks, wherein n, m and i are positive integers,

it represents a variable of the boolean type,

Further, according to the target task, a control parameter optimization value is output to the controller according to a formula (9), so that the robot is controlled by the controller according to the control parameter optimization value; wherein, the formula (9) is as follows:

in the formula, pi^*Is composed of

The strategic neural network model taking the maximum value, i.e. the trained, optimal strategic neural network model, R_tA reward function representing an output of the mission planning module.

In the embodiment of the application, an agent comprising a strategy neural network, a value neural network and a task planning module is constructed in a simulation environment, control parameters of a robot controller are used as samples through guided reinforcement learning, a label of supervised learning is provided for output actions of the strategy neural network of the agent, decisions of the strategy neural network are effectively guided to avoid a known low-return space region, exploration time is reduced, transition from the supervised learning to autonomous learning is realized according to probability in a training process, and the generalization capability of the agent can be effectively improved; through layered reinforcement learning, a plurality of subtasks are divided into a target task according to a task planning module, corresponding reward functions are constructed according to the subtasks, strategy promotion and strategy evaluation are respectively carried out on a strategy neural network and a value neural network in an intelligent body in a dynamic planning mode, the difficulty of each subtask is ensured to be matched with the decision-making capability of the intelligent body in a corresponding learning stage, an optimal strategy neural network is obtained, the robot can adaptively optimize controller parameters according to the environment condition of the robot and the state of the robot under the condition of no manual parameter adjustment, and the environment adaptability and the robustness of the robot are improved.

Exemplary System

FIG. 4 is a schematic mechanism diagram of a robot motion parameter adaptive control system based on deep reinforcement learning according to some embodiments of the present application; as shown in fig. 4, the robot motion parameter adaptive control system based on deep reinforcement learning includes: an agent building unit 401, a first learning unit 402, a second learning unit 403 and an optimization unit 404.

Agent building unit 401 is configured to build agents in a simulation environment, the agents including: a strategic neural network, a value neural network and a mission planning module.

The first learning unit 402 is configured to, based on guided reinforcement learning, according to the sample parameters, according to the formula:

p＝p₀*0.99^t+l*T

training a strategic neural network in the agent; wherein the sample parameters are control parameters of a controller of the robot; a. the_l，tRepresenting control parameters to be optimized in the controller, l representing the number of trajectories for simulation training of the robot in the simulation environment, t being a time step of the simulation training, controller (S)_l，t) Represents the output of the controller of the robot, pi (S)_l，t) Representing an output of the strategic neural network, p representing a transition probability of the strategic neural network transitioning from supervised learning to autonomous learning, p₀Is the initial value of the transition probability.

The second learning unit 403 is configured to perform policy promotion and policy evaluation on the policy neural network and the value neural network in the agent in turn and alternately according to a plurality of subtasks and reward functions corresponding to the subtasks based on hierarchical reinforcement learning, so as to obtain a trained policy neural network model; the plurality of subtasks are obtained by decomposing a target task of the robot through the task planning module, and the reward function is constructed by the task planning module according to the subtasks;

the optimization unit 404 is configured to output a control parameter optimization value to the controller according to the target task based on the trained strategic neural network model, so that the robot is controlled by the controller according to the control parameter optimization value.

In some optional embodiments, the agent building unit 401 is further configured to, according to the first fitting function:

A_t＝π(S_t)

In some optional embodiments, the state observation of the robot is collected by an inertial measurement unit and a foot end pressure sensor of the robot, wherein the state observation comprises: leg phase, touchdown detection, quaternion, angular velocity, and linear acceleration of the robot.

In some optional embodiments, the agent building unit 401 is further configured to, according to the state observed values of the robot and the parameter optimized values of the controller collected by the sensors of the robot, according to a second fitting function:

Q_t＝Q(S_t，A_t)

In some optional embodiments, the agent building unit 401 is further configured to, in the simulation environment, according to the state observation and the environmental reward of the robot collected by the sensor of the robot, according to a formula:

building a task planning module of the intelligent agent;

wherein R is_tValue of the reward function for a subtask, r_tReturning to the environmentThe reporting includes: a walking distance of the robot

Body stability of the robot

And energy consumption of said robot

and

is the offset of the robot with respect to the y-axis and z-axis,

is the torque of the motor of the robot,

In some optional embodiments, the second learning unit 403 is further configured to perform a hierarchical reinforcement learning based on the agent:

carrying out strategy promotion on the strategy neural network;

and according to the formula:

performing strategy evaluation on the value neural network;

wherein, theta_πRepresenting weights and biases of the strategic neural network, L representing the total number of trajectories of the simulated training, Q (S)_t，A_t) Representing a value neural network of said agent, A_tOptimizing the value, S, for a parameter of the controller_tA state observation, pi (S), of the robot acquired for a sensor of the robot_t) Representing the output of the strategic neural network at time t;

θ_Qweights and biases, R, representing the value neural network_tThe reward function value of the corresponding task at the time t is represented, gamma is a discount factor, the value range of gamma is (0, 1), and Q is_t+1Representing the output, Q, of the neural network at time t +1_tRepresenting the output of the value neural network at time t.

In some optional embodiments, the second learning unit 403 is further configured as a mission planning module of the agent, according to the formula:

judging the learning progress of the plurality of subtasks until the last subtask is completed to obtain the strategy neural network model;

wherein l_n、l_m、l_iRespectively representing the nth, mth and ith training tracks, wherein n, m and i are positive integers,

it represents a variable of the boolean type,

In some optional embodiments, the optimization unit 404 is further configured to, based on the target task, according to the formula:

In some optional embodiments, the parameter optimization values of the controller include: the step length, the step frequency, the leg raising height and the ground contact force of the robot.

FIG. 5 is a schematic diagram of a system architecture for adaptive control of motion parameters of a quadruped robot based on deep reinforcement learning according to some embodiments of the present application; as shown in fig. 5, in the architecture diagram of the adaptive control system for motion parameters of the quadruped robot based on deep reinforcement learning, a controller for controlling the motion of the quadruped robot adopts a layered control architecture, which includes a high-level leg trajectory control, a gait control and a bottom-level leg control, wherein the leg control is used for solving the stability problem, the gait control is used for solving the coordinated motion problem of four legs of the robot, and the interaction between the robot and the ground is accurately modeled during the leg trajectory control.

The robot motion parameter adaptive control system based on deep reinforcement learning provided by the embodiment of the application can realize the processes and steps of any robot motion parameter adaptive control method embodiment based on deep reinforcement learning, and achieve the same technical effect, and is not repeated here.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A robot motion parameter self-adaptive control method based on deep reinforcement learning is characterized by comprising the following steps:

step S102, based on the guided reinforcement learning, according to the sample parameters and the formula:

p＝p₀*0.99^t+l*T

training a strategic neural network in the agent;

wherein the sample parameters are control parameters of a controller of the robot; a. the_l，tRepresenting control parameters to be optimized in the controller, l representing the number of trajectories for simulation training of the robot in the simulation environment, t being a time step of the simulation training, controller (S)_l，t) Represents the output of the controller of the robot, pi (S)_l，t) Representing an output of the strategic neural network, p representing a transition probability of the strategic neural network transitioning from supervised learning to autonomous learning, p₀Is a preset initial value of the transition probability;

and S104, outputting a control parameter optimization value to the controller according to the target task based on the trained strategy neural network model, so that the robot is controlled by the controller according to the control parameter optimization value.

2. The adaptive control method for motion parameters of a robot based on deep reinforcement learning of claim 1, wherein in step S101,

according to the first fitting function:

A_t＝π(S_t)

3. The adaptive control method for motion parameters of a robot based on deep reinforcement learning of claim 1, wherein in step S101,

according to the state observation value of the robot and the parameter optimization value of the controller, which are acquired by the sensor of the robot, according to a second fitting function:

Q_t＝Q(S_t，A_t)

4. The adaptive control method for motion parameters of a robot based on deep reinforcement learning of claim 1, wherein in step S101,

in the simulation environment, according to the state observation value and the environment return of the robot, which are acquired by a sensor of the robot, according to a formula:

building a task planning module of the intelligent agent;

wherein R is_tBeing subtasksValue of the reward function, r_tThe environmental reward comprises: a walking distance of the robot

Body stability of the robot

And energy consumption of said robot

and

is the offset of the robot with respect to the y-axis and z-axis,

is the torque of the motor of the robot,

5. The adaptive control method for motion parameters of a robot based on deep reinforcement learning of claim 1, wherein in step S103,

based on layered reinforcement learning, the agent follows the formula:

carrying out strategy promotion on the strategy neural network;

and according to the formula:

performing strategy evaluation on the value neural network;

6. The adaptive control method for motion parameters of a robot based on deep reinforcement learning of claim 5, wherein in step S103,

the task planning module of the intelligent agent is used for planning tasks according to a formula:

it represents a variable of the boolean type,

7. The adaptive control method for motion parameters of a robot based on deep reinforcement learning of claim 6, wherein in step S104,

according to the target task, according to a formula:

outputting a control parameter optimization value to the controller so that the controller controls the robot according to the control parameter optimization value;

wherein, pi^*Is composed of

8. A robot motion parameter adaptive control system based on deep reinforcement learning is characterized by comprising:

an agent building unit configured to build an agent in a simulation environment, the agent comprising: a strategy neural network, a value neural network and a task planning module;

a first learning unit configured to, based on guided reinforcement learning, according to the sample parameters, according to a formula according to the transition probability:

p＝p₀*0.99^t+l*T

training a strategic neural network in the agent;

wherein the sample parameters are control parameters of a controller of the robot; a. the_l，tRepresenting control parameters to be optimized in the controller, l representing the number of trajectories for simulation training of the robot in the simulation environment, t being a time step of the simulation training, controller (S)_l，t) Represents the output of the controller of the robot, pi (S)_l，t) Representing an output of the strategic neural network, p representing a transition probability of the strategic neural network transitioning from supervised learning to autonomous learning, p₀Is the initial value of the transition probability;

the second learning unit is configured to perform strategy promotion and strategy evaluation on the strategy neural network and the value neural network in the agent alternately in sequence according to a plurality of subtasks and reward functions corresponding to the subtasks based on layered reinforcement learning to obtain a trained strategy neural network model; the plurality of subtasks are obtained by decomposing a target task of the robot through the task planning module, and the reward function is constructed by the task planning module according to the subtasks;

and the optimization unit is configured to output a control parameter optimization value to the controller according to the target task based on the trained strategy neural network model, so that the robot is controlled by the controller according to the control parameter optimization value.