CN114895560B

CN114895560B - Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition

Info

Publication number: CN114895560B
Application number: CN202210441986.0A
Authority: CN
Inventors: 王越; 陈词; 向平宇; 王东奇; 余纪宇; 张睿; 陆豪健; 高洪波; 熊蓉
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-04-25
Filing date: 2022-04-25
Publication date: 2024-03-19
Anticipated expiration: 2042-04-25
Also published as: CN114895560A

Abstract

The invention discloses a self-adaptive control method for object tracking of a foot-type robot under a motor locked-rotor condition, and belongs to the technical field of intelligent robots. Creating a dynamic simulation environment of the foot robot under the condition of motor locked rotor; the training upper controller is used for outputting upper control instructions according to the state of the robot, including forward, backward, left turn, right turn and stop instructions; the lower controller builds a forward dynamics model based on the multi-layer perceptron and trains the model. When the motor position control instruction is required to be selected, a corresponding lower-layer rewarding function is obtained based on the upper-layer control instruction, a prediction state corresponding to a group of motor action distribution is obtained based on the forward dynamics model, a return value of the group of actions is obtained based on the prediction state and the lower-layer rewarding function, the average value and the variance of the motor action distribution are continuously optimized with the aim of increasing the return value until the specified number of rounds is reached, and the action average value of the last round of distribution is used as the action of the robot motor to be executed next.

Description

Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition

Technical Field

The invention relates to the technical field of intelligent robots, in particular to a self-adaptive control method for object tracking of a foot robot under a motor locked-rotor condition.

Background

In the process of robot movement, damage conditions, such as motor stalling, are unavoidable due to external or internal factors. For a foot robot, when the motors of the legs are locked, if the other motors still operate according to the original instructions, the robot is caused to deviate from a predetermined track, which prevents the robot from being reliably deployed in a real environment. In order to be able to reach the desired target, the robot needs to adapt its control strategy.

The traditional robot damage recovery comprises two steps, namely self-diagnosis firstly, and then selection of the most suitable replacement strategy according to the diagnosis result, wherein the replacement strategy is designed in advance. However, alternative strategies are difficult to design, as damage conditions will grow exponentially with increasing robot complexity, and thus it is difficult to consider all possible conditions in advance. In addition, the self-diagnosis result is not always accurate due to noise of the sensor and the external environment. These all often render conventional methods ineffective.

Since alpha go re-ignites the enthusiasm of the world's people for artificial intelligence, deep reinforcement learning, which is one of its core technologies, attracts countless attention, is an important research direction in the fields of machine learning and automatic control, is also an important means for realizing strong artificial intelligence, and has received extensive attention from academia and industry in the field of intelligent robots. In the field of foot-type robot control, on the basis of a large-scale parallel simulation environment, autonomous movement of the foot-type robot in a complex and changeable environment is realized through designing a reasonable reward function, and remarkable results are obtained. However, most methods do not take into account robot damage. In a field operation environment, the phenomenon of locked rotation caused by damage of a robot motor is unavoidable, so that the self-adaptive control method of the foot robot, which can adapt to the locked rotation condition of the motor, has important significance and application value.

Disclosure of Invention

Aiming at the problem that the existing foot-type robot control method does not consider the motion control problem under the condition of motor locked-rotor of the robot, the invention provides the object tracking self-adaptive control method of the foot-type robot under the condition of motor locked-rotor.

Considering the advantages and limitations of the hierarchical reinforcement learning algorithm, the technical scheme adopted by the invention is based on the hierarchical reinforcement learning idea, and the hierarchical reinforcement learning algorithm is improved by combining a dual-source model-free separation strategy reinforcement learning method and a model-based reinforcement learning algorithm. The method comprises the steps of obtaining an upper control instruction of a robot based on a double-source model-free separation strategy reinforcement learning algorithm, selecting a corresponding lower reward function based on the upper control instruction, and finally obtaining a position control instruction of a robot motor by combining a forward dynamics model and a cross entropy optimization algorithm. According to the invention, the complex task is decomposed into specific subtasks for solving, so that the problems of few positive samples and sparse return values in the reinforcement learning training process are relieved, and the training process is greatly accelerated.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

an object tracking self-adaptive control method for a foot robot under a motor locked-rotor condition comprises the following steps:

(1) Constructing a dynamic simulation environment of the foot robot under the condition of motor locked rotor;

(2) Constructing a forward dynamics model of the foot robot by using the multilayer perceptron;

(3) Training an upper controller based on a double-source model-free separation strategy reinforcement learning method, wherein the upper controller is used for outputting an upper control instruction according to interaction data of a robot and an environment;

(4) When the motor position control instruction is required to be selected, firstly, a trained upper controller is utilized to obtain an upper control instruction, a corresponding lower rewarding function is obtained according to the instruction, the forward dynamics model obtained in the step (2) is utilized to predict the state increment after a group of actions obtained by sampling from random Gaussian distribution are executed in the current state, the return value of the corresponding actions is obtained based on the state increment and the lower rewarding function, the return value is increased, the mean value and variance of the action distribution are continuously optimized until the designated number of rounds is reached, and the mean value of the actions distributed in the last round is used as the actions executed in the next step of the robot motor.

As a preferred aspect of the present invention, the step (1) specifically includes:

constructing a three-dimensional model of a base, a left forelimb, a left forecross connecting piece, a right forelimb, a right forecross connecting piece, a left hindlimb, a left rear cross connecting piece, a right hindlimb, a right rear cross connecting piece and a camera component of the foot robot, and exporting the three-dimensional model into corresponding STL files;

the method comprises the steps of constructing an XML file of a foot robot, importing an STL file constructed in the last step under a < asset > tag, connecting all components by using a limiting rotating joint under a < world > tag, and defining all parameters of the limiting rotating joint under a < actuator > tag; after the XML file is built, loading the XML file into a physical simulation environment, checking whether the pose of each component meets the expected requirement, and if not, adjusting the figure attribute value in the < body > tag under the < world > tag to adjust the pose, and adjusting the position of the pos attribute value until each component reaches the expected pose;

an interactive interface of the simulation environment and an algorithm is built, reset (& gt) is used for initializing the state of the simulation environment, close (& gt) is used for closing a simulation environment model, step (& gt) is used for executing one-step actions, change_env (& gt) is used for realizing motor locked-rotor conditions, and change_env (& gt) is called in a reset (& gt) function.

Preferably, the input of the forward dynamics model is the current state of the foot robotAnd an action performed by the robot in the current state +.>The output of the forward dynamics model is that the robot performs the current action +.>Post predicted state change +.>Obtaining the next state of the robot according to the predicted state variable quantity>

The training process of the forward dynamics model is as follows:

firstly, acquiring the current state of a robotAnd giving a random action->The robot interacts with the environment in the dynamic simulation environment to obtain the next step of state after interaction +.>Circularly operating to obtain a series of interactive samplesAs a training data set, performing a first training on the forward dynamics model;

the data set for executing the (i+1) th training is acquired by utilizing the forward dynamics model after the (i) th training and a cross entropy optimization algorithm, and the method comprises the following steps: acquiring current state of robotAnd a group of action values obtained by sampling under random Gaussian distribution are provided, and a predicted state corresponding to each action is obtained by utilizing a forward dynamics model after the ith round of training>Obtaining the value of a lower layer rewarding function according to the prediction state; taking the increase of the lower layer reward function as a target, continuously optimizing the mean value and variance of the action distribution, taking the mean value of the action distribution of the last round as the optimal action after the number of the designated rounds is reached, forming an interaction sample by the current state of the robot, the optimal action and the next state after the robot executes the optimal action as a training data set, and training the forward dynamics model for the (i+1) th round, wherein i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;

the loss function of the training process is the predicted state change quantityAnd the actual state change amount delta _t Square sum of the mean square error value of (c) and the network weight parameter.

Preferably, the present state of the robotThe method comprises the steps of including the speed of the mass center of a robot base, euler angles, a rotation matrix, and the positions and the speeds of all motors; action->The position control command for all motors of the robot.

Preferably, the upper layer control instruction includes: forward instructions, backward instructions, left turn instructions, right turn instructions, and stationary instructions.

As the preferable choice of the invention, the dual-source model-free from-policy reinforcement method is realized based on a SAC-discrete model, and the training process of the SAC-discrete model comprises the following steps:

the first stage: acquiring a front image shot by a camera installed on a robot in a dynamic simulation environment, obtaining a relative distance d and a relative angle theta between the robot and a tracking target, and constructing training data based on a finite state machine principle, wherein the method comprises the following steps:

d is less than or equal to 20cm, when the absolute value of theta is less than or equal to 15 degrees, executing a backward instruction, when the absolute value of theta is more than 15 degrees, executing a left turn instruction, and when the absolute value of theta is less than-15 degrees, executing a right turn instruction;

under the condition that d is more than or equal to 25cm, executing a forward instruction when the angle of theta is less than or equal to 18 degrees, executing a left turn instruction when the angle of theta is more than 18 degrees, and executing a right turn instruction when the angle of theta is less than-18 degrees;

executing left turn instruction when theta is more than 10 degrees under the condition that d is more than 20cm and less than 25cm, executing right turn instruction when theta is less than-10 degrees, and executing static instruction when theta is less than or equal to 10 degrees;

the relative distance d and the relative angle theta between the robot and the tracking target are spliced together to be used as interaction data of the robot and the environment in reinforcement learning, which is called a stateAn upper control instruction is used as an action in reinforcement learning>Designing upper layer rewarding function according to relation between robot and tracking target>Construction of training samples for training SAC-discrete model +.>Training the SAC-discrete model in a first stage by using a training sample;

stage i+1: the SAC-discrete model after the i-th stage training is used for collecting training samples for the i+1-th stage training, and the training samples are specifically as follows:

acquiring a front image shot by a camera installed on a robot in a dynamic simulation environment to obtain the robot and a tracking targetThe relative distance d and the relative angle theta between the robot and the tracking target are spliced together to be used as interaction data of the robot and the environment in reinforcement learning, which is called a stateStatus +.>As an input of the SAC-discrete model after the training of the ith stage, outputting actions by the SAC-discrete model>Designing upper layer rewarding function according to relation between robot and tracking target>Construction of training samples for training at stage i+1Continuing to train the SAC-discrete model in the (i+1) th stage by using the training sample; i is more than or equal to 1; the process is repeated until a preset number of training rounds or convergence is reached.

As a preferred embodiment of the present invention, the method for designing the upper layer bonus function includes:

setting the optimal distance between the robot and the tracking target as d _opt Upper layer reward function when robot collides with tracking targetWhen the tracking target leaves the robot field of view, the upper layer reward function +.>In other cases, the upper layer reward function +.>Preferably, in the step 3,obtaining an upper control instruction based on a double-source model-free separation strategy reinforcement learning method, and directly enabling the robot to keep the motor position of the last step when the upper control instruction is a static instruction;

when the upper control instruction is a forward instruction or a backward instruction, the upper rewarding function is thatWherein w is _{i,i＝1,2,3,4} Representing the weight coefficient, deltax _base Representing the distance that the robot base centroid advances in the x-axis direction, (r, p, y) represents the euler angle of the robot base centroid, +.>Representing the sum of all joint velocities;

when the upper control command is a left turn command or a right turn command, the lower bonus function is designed asWherein y is _des Indicating the target steering angle.

9. The method for adaptively controlling object tracking of foot robot under motor stall condition according to claim 1, wherein when motor motion selection is performed by cross entropy optimization method, a set of motion values obtained by sampling under random gaussian distribution is first given, and for each motion, the current state is combinedPredicting the next state using forward dynamics model +.>

Determining a lower prize function based on the upper control command and based on the predicted next stateCalculating the value of a lower layer rewarding function;

after obtaining the lower layer rewarding function value corresponding to each group of motor actions, selecting the action corresponding to the rewarding value of the first 20% with the highest rewarding value, calculating the mean value and variance of the actions, fitting the distribution of the actions by using Gaussian distribution, and then sampling from the Gaussian distribution to obtain a group of new actions;

repeating the steps until the preset step number is reached, and taking the latest mean value of the Gaussian distribution as the action executed by the motor in the next step.

Compared with the prior art, the object tracking self-adaptive control method for the foot robot under the motor locked-rotor condition has the following advantages:

(1) The invention adopts a layered reinforcement learning algorithm (the upper layer is based on a double-source model-free separation strategy reinforcement learning method, and the lower layer is based on model reinforcement learning), realizes the goal tracking task under the condition of locked rotation of the motor of the foot robot, and solves the problems of excessive negative samples and sparse rewards in the reinforcement learning task.

(2) According to the invention, the upper control instruction is learned by using the model-free and strategy-free reinforcement learning method based on the double sources, and the data in the experience storage pool is partially from a finite state machine and partially from the interaction between the reinforcement learning algorithm and the environment, so that the diversity of samples is improved, and the training of the more robust upper control instruction is facilitated.

(3) According to the invention, a model-based reinforcement learning method is used for training a lower-layer motion instruction of the robot, and a trained dynamic model has strong generalization, so that even if a motor locked-rotor scene which is not contained in a training set is encountered, a good predicted value can be obtained; in addition, the trained dynamic model is applicable to various upper control instructions, only the corresponding reward function needs to be changed, and the method is high in expandability and suitable for migration.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a network structure of a dual-source model-free off-policy reinforcement learning algorithm, wherein a solid line represents a calculation flow of an Actor network loss function, and a dotted line represents a calculation flow of a Critic network loss function;

fig. 3 is a network structure of the forward dynamics model.

Detailed Description

The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.

The object tracking self-adaptive control method for the foot robot under the motor locked-rotor condition is based on a layered reinforcement learning algorithm, wherein the lower layer uses a model-based reinforcement learning method, so that the robot can still walk according to an expected instruction under the motor locked-rotor condition, the upper layer uses a double-source model-free reinforcement learning method, and finite state machine interaction data and algorithm self-interaction data are used as training data. The method comprises the following steps:

step one:

and constructing a dynamic simulation environment of the foot robot under the motor locked-rotor condition based on Mujoco, constructing a three-dimensional model of 10 components such as a base, a left forelimb, a left forecross connector, a right forelimb, a right forecross connector, a left hindlimb, a left hind cross connector, a right hind limb, a right hind cross connector, a camera and the like of the foot robot by using SolidWorks software, and exporting the three-dimensional model into an STL file.

And (3) constructing an XML file of the foot robot, firstly importing 10 STL files constructed in the previous step under a < asset > tag, then connecting all components under a < world > tag by using a limiting rotating joint, and finally defining all parameters of the limiting rotating joint under a < actuator > tag. Specifically, when connecting each component, A_monitor_1 is used for connecting the base and the left front cross connecting piece, the index value is 0, (the index value of the rear motor is added with one in turn, and is not repeated), A_monitor_2 is used for connecting the left front cross connecting piece and the left front limb, B_monitor_1 is used for connecting the base and the right front cross connecting piece, B_monitor_2 is used for connecting the right front cross connecting piece and the right front limb, C_monitor_1 is used for connecting the base and the left rear cross connecting piece, C_monitor_2 is used for connecting the left rear cross connecting piece and the left rear limb, D_monitor_1 is used for connecting the base and the right rear cross connecting piece, D_monitor_2 is used for connecting the right rear cross connecting piece and the right rear limb, wherein the movement ranges of A_monitor_1, B_monitor_1, C_monitor_1 and D_monitor_1 are-90 degrees to 90 degrees, the movement ranges of A_monitor_2, B_monitor_2, C_monitor_2 and D_monitor_2 are-90 degrees to 20 degrees, and the limitation of the movement ranges ensures that the robot does not interfere with other surrounding components in the movement process.

And after the XML file is built, loading the XML file into a Mujoco physical simulation environment, checking whether the pose of each component meets the requirement, and if not, adjusting the figure attribute value in the < body > tag under the < world body > tag, and adjusting the position of the pos attribute value until each component reaches the expected pose.

An interface for interaction between the simulation environment and the algorithm is built based on Gym, wherein reset (·) is used for initializing the state of the simulation environment, close (·) is used for closing the simulation environment model, step (·) is used for executing one-step action, change_env (·) is used for realizing motor stall condition, and change_env (·) is to be called in the reset (·) function. When the algorithm interacts with the environment, firstly, a reset (·) function is executed, the robot mass center position in the simulation environment is restored to a motion starting point, each joint is reset, then a chang_env (·) function is called, one (or two) numbers are selected in 0-7 to represent motor index values, an action instruction corresponding to the selected motor is reset to 0 to indicate that the motor is blocked, the original action cannot be executed, in order to visualize the blocked motor, the color of a next-stage component connected with the blocked motor is changed by using a geom_rgba interface provided by Gym (for example, the color of the left front cross connecting piece is changed to red, and the color of the right front limb is changed to red if the motor is blocked). The reset (-) function returns the current state in the simulation environment to the corresponding control algorithm, the control algorithm outputs an action value according to the current state, the step (-) function executes the action value, the newly obtained state is returned to the control algorithm, the circulation is continued until the preset maximum part is reached, the state collection is finished, and the simulation environment model is closed by using the close (-) function.

Step two:

and (3) collecting data of robot and environment interaction by using the foot-type robot dynamics simulation environment under the motor locked-rotor condition constructed in the step one, wherein the collected data are used as a training set and a verification set of a forward dynamics model, at the moment, the index value of the motor for limiting the locked-rotor of the robot is in a range of 0,3,4,7 four values, and the rest index values correspond to a test set.

In the first round of collection, as a trained dynamic model does not exist yet, a cross entropy method cannot be adopted to obtain a robot motion instruction, and therefore random actions are used for collecting interaction data, for example: firstly, acquiring the current state of a robotAnd giving a random action->The robot interacts with the environment in the dynamic simulation environment to obtain the prediction state after interaction ++>Performing cyclic operation to obtain a series of interactive samples +.>As a training dataset, a first round of training is performed on the forward dynamics model. It should be noted that when random actions are executed, the joints of the robot are required to be safely limited, so that the problem of sparse report is caused by excessive failed samples caused by rollover of the robot in the motion process is prevented.

Starting from the second round of collection, the cross entropy optimization method is used for obtaining optimized robot motion instructions for collecting interaction data, wherein the interaction data comprises the current state, because of the trained forward dynamics modelAction performed in the present state +.>And the next state reached after the execution of the action +.>Wherein the status->Comprises the speed, euler angle and rotation matrix of the mass center of the robot base, the position and speed of eight motors, and +.>The position control instructions for eight motors of the robot. When the number of interaction data collected reaches a preset maximum track length, the round is cut off and the collected data is then processed, in particular with +.>Minus->The value of the corresponding item, the state change quantity which is the same as the dimension of the corresponding item is obtained>For subsequent kinetic modeling.

In one embodiment of the present invention, starting from the second round of collection, the forward dynamics model can be continuously and iteratively trained, and the data set for executing the (i+1) th round of training is acquired by using the forward dynamics model after the (i) th round of training and the cross entropy optimization algorithm, wherein the method comprises the following steps: acquiring current state of robotAnd providing a group of action values obtained by sampling under random Gaussian distribution, and obtaining a pre-set corresponding to each action by using a forward dynamics model after the ith round of trainingMeasuring state->Obtaining the value of a lower layer rewarding function according to the prediction state; taking the increase of the lower layer reward function as a target, continuously optimizing the mean value and variance of the action distribution, taking the mean value of the action distribution of the last round as the optimal action after the number of the designated rounds is reached, forming an interaction sample by the current state of the robot, the optimal action and the next state after the robot executes the optimal action as a training data set, and training the forward dynamics model for the (i+1) th round, wherein i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;

based on the thought of supervised learning, a forward dynamics model of the robot is built by using a multi-layer perceptron, and the input is thatAnd->Outputting the result after splicing to the robot to execute the current action +.>Post predicted state change +.>In this embodiment, the hidden layer dimension is (200,200,200,200). The loss function is defined as the predicted output of the model +.>And the actual state change amount delta _t The sum of squares of the mean square error and the network weight parameter of the system, the optimizer adopts an Adam optimizer, the learning rate adopts 0.001, and the forward dynamics network parameter is updated until the maximum iteration number or convergence is reached.

Step three:

based on a double-source model-free reinforcement learning algorithm, the upper control instruction of the robot is trained. The upper layer control instruction comprises: forward instructions, backward instructions, left turn instructions, right turn instructions, and stationary instructions. Firstly, acquiring a front image by using a camera, obtaining a relative distance d and a relative angle theta between a robot and a tracking target by an image processing method, executing a backward instruction when the absolute value of the relative distance theta is less than or equal to 15 degrees under the condition that d is less than or equal to 20cm based on a finite state machine principle, executing a left turn instruction when the relative distance theta is more than 15 degrees, and executing a right turn instruction when the relative distance theta is less than-15 degrees; under the condition that d is more than or equal to 25cm, executing a forward instruction when the angle of theta is less than or equal to 18 degrees, executing a left turn instruction when the angle of theta is more than 18 degrees, and executing a right turn instruction when the angle of theta is less than-18 degrees; under the condition of 20cm < d < 25cm, when theta is more than 10 degrees, executing left turning instruction, when theta is less than-10 degrees, executing right turning instruction, and when theta is less than or equal to 10 degrees, executing static instruction.

The relative distance d and the relative angle theta between the robot and the target are spliced together to be used as interaction data of the robot and the environment in reinforcement learning, which is called reinforcement learning stateRobot tracking upper control command as action of reinforcement learning +.>Setting the optimal distance between the robot and the target object as d _opt When the robot collides with the target object, the function is rewardedWhen the object leaves the robot field of view, the reward function +.>In other cases, the reward functionObtaining actions based on finite state machine, collecting interaction data +.>Experience stored to off-policy reinforcement learning algorithm SAC-discreteIn the storage tank.

When the reinforcement learning algorithm is trained, firstly, the interaction data obtained based on the finite state machine principle is used for carrying out first-stage training, network parameters are initialized, when the training round number reaches a preset round number, actions obtained by the SAC-discrete algorithm are used for interacting with the environment and collecting interaction data, and then, the newly collected data are used for carrying out parameters of the network in the SAC-discrete algorithm.

In one implementation of the present invention, a SAC-discrete model after the i-th stage training is used to collect training samples for the i+1-th stage training, specifically:

acquiring a front image shot by a camera arranged on a robot in a dynamic simulation environment, obtaining a relative distance d and a relative angle theta between the robot and a tracking target, and splicing the relative distance d and the relative angle theta between the robot and the tracking target together to serve as interaction data of the robot and the environment in reinforcement learning, which is called a state w _{i,i＝1,2,3,4} The method comprises the steps of carrying out a first treatment on the surface of the State of the stateAs the input of the SAC-discrete model after the training of the ith stage, the SAC-discrete model outputs the action Deltax _base Designing reinforcement learning reward function according to relation between robot and tracking target, and constructing training sample +.1 for training in the (i+1) th stage>Continuing to train the SAC-discrete model in the (i+1) th stage by using the training sample; i is more than or equal to 1; the process is repeated until a preset number of training rounds or convergence is reached.

As shown in FIG. 2, SAC-discrete employs an Actor-Critic architecture, comprising an Actor network, two Critic networks, and two target Critic networks. Wherein the input of the Acottr network isThe probability of selecting each action is output, the network dimension of the hidden layer is (256 ), and two Critic networks (Critic networksComplex 1, critic network 2) is input +.>The Q value corresponding to each discrete action is output, and the hidden layer has dimensions of 256, 256. In calculating the loss function of the Actor network, the current state is first of all +.>Input into Critic network 1 and Critic network 2 to obtain Q ₁ And Q ₂ The smaller of the two sets of Q values is selected and denoted as Q _min Then the current state ++>Input into the Actor network, get the probability of selecting each action +.>Using the probability value and Q _min And calculating to obtain a loss function of the Actor network, and marking the loss function as actor_loss. When calculating the loss function of Critic network, the next moment state is first +.>Input to target Critic network 1 and target Critic network 2, and take the output result as targetQ with smaller value _min In combination with prize value->Calculating a target Q value, denoted as Q _target Then the current time state +.>Input into Critic network 1 and Critic network 2 to obtain Q ₁ And Q ₂ ，Q ₁ And Q is equal to _target Taking the mean square error as the loss function of the Critic network 1, Q ₂ And Q is equal to _target The mean square error is taken as a loss function of the Critic network 2. Parameters of the target Critic network 1 are derived from parameters of Critic network 1 at intervalsThe parameters of the target Critic network 2 are copied in a certain proportion in the numbers, and similarly, the parameters of the target Critic network 2 are copied in a certain proportion from the parameters of the Critic network 2 at intervals, so that the target Critic network 1 and the target Critic network 2 have no defined loss function. The training process adopts a batch training method, the batch size is set to 64, and the learning rate is set to 0.0005.

Step four:

designing a lower layer rewarding function of the robot based on the upper layer control instruction, wherein when the robot needs to execute a forward instruction or a backward instruction, the rewarding function is designed as followsWherein w is _{i,i＝1,2,3,4} Representing the weight coefficient, deltax _base Representing the distance that the robot base centroid advances in the x-axis direction, (r, p, y) represents the euler angle of the robot base centroid, +.>Representing the sum of all joint speeds, this reward function has the meaning of encouraging the robot to move along the x-axis, minimizing rotation along the xyz-axis, and minimizing the speed of the joint movement, w when executing the advance command ₁ Taking the positive number, w when executing the back instruction ₁ Taking the negative number. When the robot needs to execute a left turn or right turn instruction, the bonus function is designed as +.>Wherein the steering angle y when a left turn instruction needs to be executed _des = -30 °, steering angle y when right turn command needs to be executed _des ＝30°。

When a cross entropy optimization algorithm is used for motion selection, firstly, a trained upper controller is obtained based on the third step, and the current state is combinedAnd obtaining an upper control instruction which needs to be executed by the robot, and selecting a corresponding lower reward function. Then a set of random movements is initializedFor each action, combine the current state +.>Predicting the next state based on the forward dynamics model obtained in step two +.>And calculating the value of a lower layer reward function, after obtaining the value of each action, selecting the action corresponding to the first 20% of the reward value with the highest reward value, calculating the mean value and variance of the actions, fitting the distribution of the actions by using Gaussian distribution, sampling from the Gaussian distribution to obtain a group of new actions, repeating the steps to obtain the new Gaussian distribution, and continuously repeating the process until the preset step number is reached, and taking the mean value of the latest Gaussian distribution as the position control instruction of the robot motor after the preset step number is reached.

The foregoing list is only illustrative of specific embodiments of the invention. Obviously, the invention is not limited to the above embodiments, but many variations are possible. All modifications directly derived or suggested to one skilled in the art from the present disclosure should be considered as being within the scope of the present invention.

Claims

1. The self-adaptive control method for object tracking of the foot robot under the condition of motor locked rotation is characterized by comprising the following steps:

(2) Constructing a forward dynamics model of the foot robot by using the multilayer perceptron; the input of the forward dynamics model is the current state of the foot robotAnd an action performed by the robot in the current state +.>The output of the forward dynamics model is a robotExecute the current action->Post predicted state change +.>Obtaining the next state of the robot according to the predicted state variable quantity>

The training process of the forward dynamics model is as follows:

firstly, acquiring the current state of a robotAnd giving a random action->The robot interacts with the environment in the dynamic simulation environment to obtain the next step of state after interaction +.>Performing cyclic operation to obtain a series of interactive samples +.>As a training data set, performing a first training on the forward dynamics model;

the data set for executing the (i+1) th training is acquired by utilizing the forward dynamics model after the (i) th training and a cross entropy optimization algorithm, and the method comprises the following steps: acquiring current state of robotAnd a group of action values obtained by sampling under random Gaussian distribution are provided, and a predicted state corresponding to each action is obtained by utilizing a forward dynamics model after the ith round of training>Obtaining the lower layer according to the predicted stateThe value of the rewarding function; taking the increase of the lower layer reward function as a target, continuously optimizing the mean value and variance of the action distribution, taking the mean value of the action distribution of the last round as the optimal action after the number of the designated rounds is reached, forming an interaction sample by the current state of the robot, the optimal action and the next state after the robot executes the optimal action as a training data set, and training the forward dynamics model for the (i+1) th round, wherein i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;

the loss function of the training process is the predicted state change quantityAnd the actual state change amount delta _t Square sum of the mean square error value of (2) and the network weight parameter;

(3) Training an upper controller based on a double-source model-free separation strategy reinforcement learning method, wherein the upper controller is used for outputting an upper control instruction according to interaction data of a robot and an environment; the double-source model-free separation strategy reinforcement learning method is realized based on a SAC-discrete model, and the training process of the SAC-discrete model comprises the following steps:

stitching together the relative distance d and the relative angle θ between the robot and the tracking target as a robot and a ring in reinforcement learningInteraction data of a context, called stateAn upper control instruction is used as an action in reinforcement learning>Designing upper layer rewarding function r according to relation between robot and tracking target _t ^u Construction of training samples for training SAC-discrete model>Training the SAC-discrete model in a first stage by using a training sample;

acquiring a front image shot by a camera arranged on a robot in a dynamic simulation environment, obtaining a relative distance d and a relative angle theta between the robot and a tracking target, and splicing the relative distance d and the relative angle theta between the robot and the tracking target together to serve as interaction data of the robot and the environment in reinforcement learning, which is called a stateStatus +.>As an input of the SAC-discrete model after the training of the ith stage, outputting actions by the SAC-discrete model>Designing upper layer rewarding function r according to relation between robot and tracking target _t ^u Construction of training sample +.1 for training in stage i+1>Continuing to train SAC-discrete model in (i+1) th stage by using training sampleTraining; i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;

2. The self-adaptive control method for object tracking of foot robot under motor locked-rotor condition of claim 1, wherein the step (1) specifically comprises:

3. The adaptive control method for object tracking of foot robot under motor stalling condition of claim 1, wherein the current state of the robotThe method comprises the steps of including the speed of the mass center of a robot base, euler angles, a rotation matrix, and the positions and the speeds of all motors; action->The position control command for all motors of the robot.

4. The adaptive control method for object tracking of foot robot under motor stall condition of claim 1, wherein said upper control command comprises: forward instructions, backward instructions, left turn instructions, right turn instructions, and stationary instructions.

5. The self-adaptive control method for object tracking of foot robot under motor locked-rotor condition of claim 1, wherein the design method of the upper layer rewarding function is as follows:

setting the optimal distance between the robot and the tracking target as d _opt Upper layer reward function r when robot collides with tracking target _t ^u -500; upper layer reward function r when tracking target out of robot view _t ^u -100; in other cases, the upper layer bonus function

6. The method for adaptively controlling object tracking of a foot robot under a motor locked-rotor condition according to claim 1, wherein in the step (3), an upper control command is obtained based on a dual-source model-free release strategy reinforcement learning method, and when the upper control command is a stationary command, the robot is directly caused to maintain the motor position of the previous step;

when the upper control instruction is a forward instruction or a backward instruction, the upper rewarding function is thatWherein w is _i I=1, 2,3,4 denotes a weight coefficient, Δx _base Representing the distance that the robot base centroid advances in the x-axis direction, (r, p, y) represents the euler angle of the robot base centroid, +.>Representing the sum of all joint velocities;

7. The method for adaptively controlling object tracking of foot robot under motor stall condition according to claim 1, wherein when motor motion selection is performed by cross entropy optimization method, a set of motion values obtained by sampling under random gaussian distribution is first given, and for each motion, the current state is combinedPredicting the next state using forward dynamics model +.>