CN114895560B - Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition - Google Patents

Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition Download PDF

Info

Publication number
CN114895560B
CN114895560B CN202210441986.0A CN202210441986A CN114895560B CN 114895560 B CN114895560 B CN 114895560B CN 202210441986 A CN202210441986 A CN 202210441986A CN 114895560 B CN114895560 B CN 114895560B
Authority
CN
China
Prior art keywords
robot
training
action
motor
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210441986.0A
Other languages
Chinese (zh)
Other versions
CN114895560A (en
Inventor
王越
陈词
向平宇
王东奇
余纪宇
张睿
陆豪健
高洪波
熊蓉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210441986.0A priority Critical patent/CN114895560B/en
Publication of CN114895560A publication Critical patent/CN114895560A/en
Application granted granted Critical
Publication of CN114895560B publication Critical patent/CN114895560B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a self-adaptive control method for object tracking of a foot-type robot under a motor locked-rotor condition, and belongs to the technical field of intelligent robots. Creating a dynamic simulation environment of the foot robot under the condition of motor locked rotor; the training upper controller is used for outputting upper control instructions according to the state of the robot, including forward, backward, left turn, right turn and stop instructions; the lower controller builds a forward dynamics model based on the multi-layer perceptron and trains the model. When the motor position control instruction is required to be selected, a corresponding lower-layer rewarding function is obtained based on the upper-layer control instruction, a prediction state corresponding to a group of motor action distribution is obtained based on the forward dynamics model, a return value of the group of actions is obtained based on the prediction state and the lower-layer rewarding function, the average value and the variance of the motor action distribution are continuously optimized with the aim of increasing the return value until the specified number of rounds is reached, and the action average value of the last round of distribution is used as the action of the robot motor to be executed next.

Description

Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition
Technical Field
The invention relates to the technical field of intelligent robots, in particular to a self-adaptive control method for object tracking of a foot robot under a motor locked-rotor condition.
Background
In the process of robot movement, damage conditions, such as motor stalling, are unavoidable due to external or internal factors. For a foot robot, when the motors of the legs are locked, if the other motors still operate according to the original instructions, the robot is caused to deviate from a predetermined track, which prevents the robot from being reliably deployed in a real environment. In order to be able to reach the desired target, the robot needs to adapt its control strategy.
The traditional robot damage recovery comprises two steps, namely self-diagnosis firstly, and then selection of the most suitable replacement strategy according to the diagnosis result, wherein the replacement strategy is designed in advance. However, alternative strategies are difficult to design, as damage conditions will grow exponentially with increasing robot complexity, and thus it is difficult to consider all possible conditions in advance. In addition, the self-diagnosis result is not always accurate due to noise of the sensor and the external environment. These all often render conventional methods ineffective.
Since alpha go re-ignites the enthusiasm of the world's people for artificial intelligence, deep reinforcement learning, which is one of its core technologies, attracts countless attention, is an important research direction in the fields of machine learning and automatic control, is also an important means for realizing strong artificial intelligence, and has received extensive attention from academia and industry in the field of intelligent robots. In the field of foot-type robot control, on the basis of a large-scale parallel simulation environment, autonomous movement of the foot-type robot in a complex and changeable environment is realized through designing a reasonable reward function, and remarkable results are obtained. However, most methods do not take into account robot damage. In a field operation environment, the phenomenon of locked rotation caused by damage of a robot motor is unavoidable, so that the self-adaptive control method of the foot robot, which can adapt to the locked rotation condition of the motor, has important significance and application value.
Disclosure of Invention
Aiming at the problem that the existing foot-type robot control method does not consider the motion control problem under the condition of motor locked-rotor of the robot, the invention provides the object tracking self-adaptive control method of the foot-type robot under the condition of motor locked-rotor.
Considering the advantages and limitations of the hierarchical reinforcement learning algorithm, the technical scheme adopted by the invention is based on the hierarchical reinforcement learning idea, and the hierarchical reinforcement learning algorithm is improved by combining a dual-source model-free separation strategy reinforcement learning method and a model-based reinforcement learning algorithm. The method comprises the steps of obtaining an upper control instruction of a robot based on a double-source model-free separation strategy reinforcement learning algorithm, selecting a corresponding lower reward function based on the upper control instruction, and finally obtaining a position control instruction of a robot motor by combining a forward dynamics model and a cross entropy optimization algorithm. According to the invention, the complex task is decomposed into specific subtasks for solving, so that the problems of few positive samples and sparse return values in the reinforcement learning training process are relieved, and the training process is greatly accelerated.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
an object tracking self-adaptive control method for a foot robot under a motor locked-rotor condition comprises the following steps:
(1) Constructing a dynamic simulation environment of the foot robot under the condition of motor locked rotor;
(2) Constructing a forward dynamics model of the foot robot by using the multilayer perceptron;
(3) Training an upper controller based on a double-source model-free separation strategy reinforcement learning method, wherein the upper controller is used for outputting an upper control instruction according to interaction data of a robot and an environment;
(4) When the motor position control instruction is required to be selected, firstly, a trained upper controller is utilized to obtain an upper control instruction, a corresponding lower rewarding function is obtained according to the instruction, the forward dynamics model obtained in the step (2) is utilized to predict the state increment after a group of actions obtained by sampling from random Gaussian distribution are executed in the current state, the return value of the corresponding actions is obtained based on the state increment and the lower rewarding function, the return value is increased, the mean value and variance of the action distribution are continuously optimized until the designated number of rounds is reached, and the mean value of the actions distributed in the last round is used as the actions executed in the next step of the robot motor.
As a preferred aspect of the present invention, the step (1) specifically includes:
constructing a three-dimensional model of a base, a left forelimb, a left forecross connecting piece, a right forelimb, a right forecross connecting piece, a left hindlimb, a left rear cross connecting piece, a right hindlimb, a right rear cross connecting piece and a camera component of the foot robot, and exporting the three-dimensional model into corresponding STL files;
the method comprises the steps of constructing an XML file of a foot robot, importing an STL file constructed in the last step under a < asset > tag, connecting all components by using a limiting rotating joint under a < world > tag, and defining all parameters of the limiting rotating joint under a < actuator > tag; after the XML file is built, loading the XML file into a physical simulation environment, checking whether the pose of each component meets the expected requirement, and if not, adjusting the figure attribute value in the < body > tag under the < world > tag to adjust the pose, and adjusting the position of the pos attribute value until each component reaches the expected pose;
an interactive interface of the simulation environment and an algorithm is built, reset (& gt) is used for initializing the state of the simulation environment, close (& gt) is used for closing a simulation environment model, step (& gt) is used for executing one-step actions, change_env (& gt) is used for realizing motor locked-rotor conditions, and change_env (& gt) is called in a reset (& gt) function.
Preferably, the input of the forward dynamics model is the current state of the foot robotAnd an action performed by the robot in the current state +.>The output of the forward dynamics model is that the robot performs the current action +.>Post predicted state change +.>Obtaining the next state of the robot according to the predicted state variable quantity>
The training process of the forward dynamics model is as follows:
firstly, acquiring the current state of a robotAnd giving a random action->The robot interacts with the environment in the dynamic simulation environment to obtain the next step of state after interaction +.>Circularly operating to obtain a series of interactive samplesAs a training data set, performing a first training on the forward dynamics model;
the data set for executing the (i+1) th training is acquired by utilizing the forward dynamics model after the (i) th training and a cross entropy optimization algorithm, and the method comprises the following steps: acquiring current state of robotAnd a group of action values obtained by sampling under random Gaussian distribution are provided, and a predicted state corresponding to each action is obtained by utilizing a forward dynamics model after the ith round of training>Obtaining the value of a lower layer rewarding function according to the prediction state; taking the increase of the lower layer reward function as a target, continuously optimizing the mean value and variance of the action distribution, taking the mean value of the action distribution of the last round as the optimal action after the number of the designated rounds is reached, forming an interaction sample by the current state of the robot, the optimal action and the next state after the robot executes the optimal action as a training data set, and training the forward dynamics model for the (i+1) th round, wherein i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;
the loss function of the training process is the predicted state change quantityAnd the actual state change amount delta t Square sum of the mean square error value of (c) and the network weight parameter.
Preferably, the present state of the robotThe method comprises the steps of including the speed of the mass center of a robot base, euler angles, a rotation matrix, and the positions and the speeds of all motors; action->The position control command for all motors of the robot.
Preferably, the upper layer control instruction includes: forward instructions, backward instructions, left turn instructions, right turn instructions, and stationary instructions.
As the preferable choice of the invention, the dual-source model-free from-policy reinforcement method is realized based on a SAC-discrete model, and the training process of the SAC-discrete model comprises the following steps:
the first stage: acquiring a front image shot by a camera installed on a robot in a dynamic simulation environment, obtaining a relative distance d and a relative angle theta between the robot and a tracking target, and constructing training data based on a finite state machine principle, wherein the method comprises the following steps:
d is less than or equal to 20cm, when the absolute value of theta is less than or equal to 15 degrees, executing a backward instruction, when the absolute value of theta is more than 15 degrees, executing a left turn instruction, and when the absolute value of theta is less than-15 degrees, executing a right turn instruction;
under the condition that d is more than or equal to 25cm, executing a forward instruction when the angle of theta is less than or equal to 18 degrees, executing a left turn instruction when the angle of theta is more than 18 degrees, and executing a right turn instruction when the angle of theta is less than-18 degrees;
executing left turn instruction when theta is more than 10 degrees under the condition that d is more than 20cm and less than 25cm, executing right turn instruction when theta is less than-10 degrees, and executing static instruction when theta is less than or equal to 10 degrees;
the relative distance d and the relative angle theta between the robot and the tracking target are spliced together to be used as interaction data of the robot and the environment in reinforcement learning, which is called a stateAn upper control instruction is used as an action in reinforcement learning>Designing upper layer rewarding function according to relation between robot and tracking target>Construction of training samples for training SAC-discrete model +.>Training the SAC-discrete model in a first stage by using a training sample;
stage i+1: the SAC-discrete model after the i-th stage training is used for collecting training samples for the i+1-th stage training, and the training samples are specifically as follows:
acquiring a front image shot by a camera installed on a robot in a dynamic simulation environment to obtain the robot and a tracking targetThe relative distance d and the relative angle theta between the robot and the tracking target are spliced together to be used as interaction data of the robot and the environment in reinforcement learning, which is called a stateStatus +.>As an input of the SAC-discrete model after the training of the ith stage, outputting actions by the SAC-discrete model>Designing upper layer rewarding function according to relation between robot and tracking target>Construction of training samples for training at stage i+1Continuing to train the SAC-discrete model in the (i+1) th stage by using the training sample; i is more than or equal to 1; the process is repeated until a preset number of training rounds or convergence is reached.
As a preferred embodiment of the present invention, the method for designing the upper layer bonus function includes:
setting the optimal distance between the robot and the tracking target as d opt Upper layer reward function when robot collides with tracking targetWhen the tracking target leaves the robot field of view, the upper layer reward function +.>In other cases, the upper layer reward function +.>Preferably, in the step 3,obtaining an upper control instruction based on a double-source model-free separation strategy reinforcement learning method, and directly enabling the robot to keep the motor position of the last step when the upper control instruction is a static instruction;
when the upper control instruction is a forward instruction or a backward instruction, the upper rewarding function is thatWherein w is i,i=1,2,3,4 Representing the weight coefficient, deltax base Representing the distance that the robot base centroid advances in the x-axis direction, (r, p, y) represents the euler angle of the robot base centroid, +.>Representing the sum of all joint velocities;
when the upper control command is a left turn command or a right turn command, the lower bonus function is designed asWherein y is des Indicating the target steering angle.
9. The method for adaptively controlling object tracking of foot robot under motor stall condition according to claim 1, wherein when motor motion selection is performed by cross entropy optimization method, a set of motion values obtained by sampling under random gaussian distribution is first given, and for each motion, the current state is combinedPredicting the next state using forward dynamics model +.>
Determining a lower prize function based on the upper control command and based on the predicted next stateCalculating the value of a lower layer rewarding function;
after obtaining the lower layer rewarding function value corresponding to each group of motor actions, selecting the action corresponding to the rewarding value of the first 20% with the highest rewarding value, calculating the mean value and variance of the actions, fitting the distribution of the actions by using Gaussian distribution, and then sampling from the Gaussian distribution to obtain a group of new actions;
repeating the steps until the preset step number is reached, and taking the latest mean value of the Gaussian distribution as the action executed by the motor in the next step.
Compared with the prior art, the object tracking self-adaptive control method for the foot robot under the motor locked-rotor condition has the following advantages:
(1) The invention adopts a layered reinforcement learning algorithm (the upper layer is based on a double-source model-free separation strategy reinforcement learning method, and the lower layer is based on model reinforcement learning), realizes the goal tracking task under the condition of locked rotation of the motor of the foot robot, and solves the problems of excessive negative samples and sparse rewards in the reinforcement learning task.
(2) According to the invention, the upper control instruction is learned by using the model-free and strategy-free reinforcement learning method based on the double sources, and the data in the experience storage pool is partially from a finite state machine and partially from the interaction between the reinforcement learning algorithm and the environment, so that the diversity of samples is improved, and the training of the more robust upper control instruction is facilitated.
(3) According to the invention, a model-based reinforcement learning method is used for training a lower-layer motion instruction of the robot, and a trained dynamic model has strong generalization, so that even if a motor locked-rotor scene which is not contained in a training set is encountered, a good predicted value can be obtained; in addition, the trained dynamic model is applicable to various upper control instructions, only the corresponding reward function needs to be changed, and the method is high in expandability and suitable for migration.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a network structure of a dual-source model-free off-policy reinforcement learning algorithm, wherein a solid line represents a calculation flow of an Actor network loss function, and a dotted line represents a calculation flow of a Critic network loss function;
fig. 3 is a network structure of the forward dynamics model.
Detailed Description
The present invention is further illustrated below in conjunction with specific embodiments, it being understood that these embodiments are meant to be illustrative of the invention only and not limiting the scope of the invention, and that modifications of the invention, which are equivalent to those skilled in the art to which the invention pertains, will fall within the scope of the invention as defined in the claims appended hereto.
The object tracking self-adaptive control method for the foot robot under the motor locked-rotor condition is based on a layered reinforcement learning algorithm, wherein the lower layer uses a model-based reinforcement learning method, so that the robot can still walk according to an expected instruction under the motor locked-rotor condition, the upper layer uses a double-source model-free reinforcement learning method, and finite state machine interaction data and algorithm self-interaction data are used as training data. The method comprises the following steps:
step one:
and constructing a dynamic simulation environment of the foot robot under the motor locked-rotor condition based on Mujoco, constructing a three-dimensional model of 10 components such as a base, a left forelimb, a left forecross connector, a right forelimb, a right forecross connector, a left hindlimb, a left hind cross connector, a right hind limb, a right hind cross connector, a camera and the like of the foot robot by using SolidWorks software, and exporting the three-dimensional model into an STL file.
And (3) constructing an XML file of the foot robot, firstly importing 10 STL files constructed in the previous step under a < asset > tag, then connecting all components under a < world > tag by using a limiting rotating joint, and finally defining all parameters of the limiting rotating joint under a < actuator > tag. Specifically, when connecting each component, A_monitor_1 is used for connecting the base and the left front cross connecting piece, the index value is 0, (the index value of the rear motor is added with one in turn, and is not repeated), A_monitor_2 is used for connecting the left front cross connecting piece and the left front limb, B_monitor_1 is used for connecting the base and the right front cross connecting piece, B_monitor_2 is used for connecting the right front cross connecting piece and the right front limb, C_monitor_1 is used for connecting the base and the left rear cross connecting piece, C_monitor_2 is used for connecting the left rear cross connecting piece and the left rear limb, D_monitor_1 is used for connecting the base and the right rear cross connecting piece, D_monitor_2 is used for connecting the right rear cross connecting piece and the right rear limb, wherein the movement ranges of A_monitor_1, B_monitor_1, C_monitor_1 and D_monitor_1 are-90 degrees to 90 degrees, the movement ranges of A_monitor_2, B_monitor_2, C_monitor_2 and D_monitor_2 are-90 degrees to 20 degrees, and the limitation of the movement ranges ensures that the robot does not interfere with other surrounding components in the movement process.
And after the XML file is built, loading the XML file into a Mujoco physical simulation environment, checking whether the pose of each component meets the requirement, and if not, adjusting the figure attribute value in the < body > tag under the < world body > tag, and adjusting the position of the pos attribute value until each component reaches the expected pose.
An interface for interaction between the simulation environment and the algorithm is built based on Gym, wherein reset (·) is used for initializing the state of the simulation environment, close (·) is used for closing the simulation environment model, step (·) is used for executing one-step action, change_env (·) is used for realizing motor stall condition, and change_env (·) is to be called in the reset (·) function. When the algorithm interacts with the environment, firstly, a reset (·) function is executed, the robot mass center position in the simulation environment is restored to a motion starting point, each joint is reset, then a chang_env (·) function is called, one (or two) numbers are selected in 0-7 to represent motor index values, an action instruction corresponding to the selected motor is reset to 0 to indicate that the motor is blocked, the original action cannot be executed, in order to visualize the blocked motor, the color of a next-stage component connected with the blocked motor is changed by using a geom_rgba interface provided by Gym (for example, the color of the left front cross connecting piece is changed to red, and the color of the right front limb is changed to red if the motor is blocked). The reset (-) function returns the current state in the simulation environment to the corresponding control algorithm, the control algorithm outputs an action value according to the current state, the step (-) function executes the action value, the newly obtained state is returned to the control algorithm, the circulation is continued until the preset maximum part is reached, the state collection is finished, and the simulation environment model is closed by using the close (-) function.
Step two:
and (3) collecting data of robot and environment interaction by using the foot-type robot dynamics simulation environment under the motor locked-rotor condition constructed in the step one, wherein the collected data are used as a training set and a verification set of a forward dynamics model, at the moment, the index value of the motor for limiting the locked-rotor of the robot is in a range of 0,3,4,7 four values, and the rest index values correspond to a test set.
In the first round of collection, as a trained dynamic model does not exist yet, a cross entropy method cannot be adopted to obtain a robot motion instruction, and therefore random actions are used for collecting interaction data, for example: firstly, acquiring the current state of a robotAnd giving a random action->The robot interacts with the environment in the dynamic simulation environment to obtain the prediction state after interaction ++>Performing cyclic operation to obtain a series of interactive samples +.>As a training dataset, a first round of training is performed on the forward dynamics model. It should be noted that when random actions are executed, the joints of the robot are required to be safely limited, so that the problem of sparse report is caused by excessive failed samples caused by rollover of the robot in the motion process is prevented.
Starting from the second round of collection, the cross entropy optimization method is used for obtaining optimized robot motion instructions for collecting interaction data, wherein the interaction data comprises the current state, because of the trained forward dynamics modelAction performed in the present state +.>And the next state reached after the execution of the action +.>Wherein the status->Comprises the speed, euler angle and rotation matrix of the mass center of the robot base, the position and speed of eight motors, and +.>The position control instructions for eight motors of the robot. When the number of interaction data collected reaches a preset maximum track length, the round is cut off and the collected data is then processed, in particular with +.>Minus->The value of the corresponding item, the state change quantity which is the same as the dimension of the corresponding item is obtained>For subsequent kinetic modeling.
In one embodiment of the present invention, starting from the second round of collection, the forward dynamics model can be continuously and iteratively trained, and the data set for executing the (i+1) th round of training is acquired by using the forward dynamics model after the (i) th round of training and the cross entropy optimization algorithm, wherein the method comprises the following steps: acquiring current state of robotAnd providing a group of action values obtained by sampling under random Gaussian distribution, and obtaining a pre-set corresponding to each action by using a forward dynamics model after the ith round of trainingMeasuring state->Obtaining the value of a lower layer rewarding function according to the prediction state; taking the increase of the lower layer reward function as a target, continuously optimizing the mean value and variance of the action distribution, taking the mean value of the action distribution of the last round as the optimal action after the number of the designated rounds is reached, forming an interaction sample by the current state of the robot, the optimal action and the next state after the robot executes the optimal action as a training data set, and training the forward dynamics model for the (i+1) th round, wherein i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;
based on the thought of supervised learning, a forward dynamics model of the robot is built by using a multi-layer perceptron, and the input is thatAnd->Outputting the result after splicing to the robot to execute the current action +.>Post predicted state change +.>In this embodiment, the hidden layer dimension is (200,200,200,200). The loss function is defined as the predicted output of the model +.>And the actual state change amount delta t The sum of squares of the mean square error and the network weight parameter of the system, the optimizer adopts an Adam optimizer, the learning rate adopts 0.001, and the forward dynamics network parameter is updated until the maximum iteration number or convergence is reached.
Step three:
based on a double-source model-free reinforcement learning algorithm, the upper control instruction of the robot is trained. The upper layer control instruction comprises: forward instructions, backward instructions, left turn instructions, right turn instructions, and stationary instructions. Firstly, acquiring a front image by using a camera, obtaining a relative distance d and a relative angle theta between a robot and a tracking target by an image processing method, executing a backward instruction when the absolute value of the relative distance theta is less than or equal to 15 degrees under the condition that d is less than or equal to 20cm based on a finite state machine principle, executing a left turn instruction when the relative distance theta is more than 15 degrees, and executing a right turn instruction when the relative distance theta is less than-15 degrees; under the condition that d is more than or equal to 25cm, executing a forward instruction when the angle of theta is less than or equal to 18 degrees, executing a left turn instruction when the angle of theta is more than 18 degrees, and executing a right turn instruction when the angle of theta is less than-18 degrees; under the condition of 20cm < d < 25cm, when theta is more than 10 degrees, executing left turning instruction, when theta is less than-10 degrees, executing right turning instruction, and when theta is less than or equal to 10 degrees, executing static instruction.
The relative distance d and the relative angle theta between the robot and the target are spliced together to be used as interaction data of the robot and the environment in reinforcement learning, which is called reinforcement learning stateRobot tracking upper control command as action of reinforcement learning +.>Setting the optimal distance between the robot and the target object as d opt When the robot collides with the target object, the function is rewardedWhen the object leaves the robot field of view, the reward function +.>In other cases, the reward functionObtaining actions based on finite state machine, collecting interaction data +.>Experience stored to off-policy reinforcement learning algorithm SAC-discreteIn the storage tank.
When the reinforcement learning algorithm is trained, firstly, the interaction data obtained based on the finite state machine principle is used for carrying out first-stage training, network parameters are initialized, when the training round number reaches a preset round number, actions obtained by the SAC-discrete algorithm are used for interacting with the environment and collecting interaction data, and then, the newly collected data are used for carrying out parameters of the network in the SAC-discrete algorithm.
In one implementation of the present invention, a SAC-discrete model after the i-th stage training is used to collect training samples for the i+1-th stage training, specifically:
acquiring a front image shot by a camera arranged on a robot in a dynamic simulation environment, obtaining a relative distance d and a relative angle theta between the robot and a tracking target, and splicing the relative distance d and the relative angle theta between the robot and the tracking target together to serve as interaction data of the robot and the environment in reinforcement learning, which is called a state w i,i=1,2,3,4 The method comprises the steps of carrying out a first treatment on the surface of the State of the stateAs the input of the SAC-discrete model after the training of the ith stage, the SAC-discrete model outputs the action Deltax base Designing reinforcement learning reward function according to relation between robot and tracking target, and constructing training sample +.1 for training in the (i+1) th stage>Continuing to train the SAC-discrete model in the (i+1) th stage by using the training sample; i is more than or equal to 1; the process is repeated until a preset number of training rounds or convergence is reached.
As shown in FIG. 2, SAC-discrete employs an Actor-Critic architecture, comprising an Actor network, two Critic networks, and two target Critic networks. Wherein the input of the Acottr network isThe probability of selecting each action is output, the network dimension of the hidden layer is (256 ), and two Critic networks (Critic networksComplex 1, critic network 2) is input +.>The Q value corresponding to each discrete action is output, and the hidden layer has dimensions of 256, 256. In calculating the loss function of the Actor network, the current state is first of all +.>Input into Critic network 1 and Critic network 2 to obtain Q 1 And Q 2 The smaller of the two sets of Q values is selected and denoted as Q min Then the current state ++>Input into the Actor network, get the probability of selecting each action +.>Using the probability value and Q min And calculating to obtain a loss function of the Actor network, and marking the loss function as actor_loss. When calculating the loss function of Critic network, the next moment state is first +.>Input to target Critic network 1 and target Critic network 2, and take the output result as targetQ with smaller value min In combination with prize value->Calculating a target Q value, denoted as Q target Then the current time state +.>Input into Critic network 1 and Critic network 2 to obtain Q 1 And Q 2 ,Q 1 And Q is equal to target Taking the mean square error as the loss function of the Critic network 1, Q 2 And Q is equal to target The mean square error is taken as a loss function of the Critic network 2. Parameters of the target Critic network 1 are derived from parameters of Critic network 1 at intervalsThe parameters of the target Critic network 2 are copied in a certain proportion in the numbers, and similarly, the parameters of the target Critic network 2 are copied in a certain proportion from the parameters of the Critic network 2 at intervals, so that the target Critic network 1 and the target Critic network 2 have no defined loss function. The training process adopts a batch training method, the batch size is set to 64, and the learning rate is set to 0.0005.
Step four:
designing a lower layer rewarding function of the robot based on the upper layer control instruction, wherein when the robot needs to execute a forward instruction or a backward instruction, the rewarding function is designed as followsWherein w is i,i=1,2,3,4 Representing the weight coefficient, deltax base Representing the distance that the robot base centroid advances in the x-axis direction, (r, p, y) represents the euler angle of the robot base centroid, +.>Representing the sum of all joint speeds, this reward function has the meaning of encouraging the robot to move along the x-axis, minimizing rotation along the xyz-axis, and minimizing the speed of the joint movement, w when executing the advance command 1 Taking the positive number, w when executing the back instruction 1 Taking the negative number. When the robot needs to execute a left turn or right turn instruction, the bonus function is designed as +.>Wherein the steering angle y when a left turn instruction needs to be executed des = -30 °, steering angle y when right turn command needs to be executed des =30°。
When a cross entropy optimization algorithm is used for motion selection, firstly, a trained upper controller is obtained based on the third step, and the current state is combinedAnd obtaining an upper control instruction which needs to be executed by the robot, and selecting a corresponding lower reward function. Then a set of random movements is initializedFor each action, combine the current state +.>Predicting the next state based on the forward dynamics model obtained in step two +.>And calculating the value of a lower layer reward function, after obtaining the value of each action, selecting the action corresponding to the first 20% of the reward value with the highest reward value, calculating the mean value and variance of the actions, fitting the distribution of the actions by using Gaussian distribution, sampling from the Gaussian distribution to obtain a group of new actions, repeating the steps to obtain the new Gaussian distribution, and continuously repeating the process until the preset step number is reached, and taking the mean value of the latest Gaussian distribution as the position control instruction of the robot motor after the preset step number is reached.
The foregoing list is only illustrative of specific embodiments of the invention. Obviously, the invention is not limited to the above embodiments, but many variations are possible. All modifications directly derived or suggested to one skilled in the art from the present disclosure should be considered as being within the scope of the present invention.

Claims (7)

1. The self-adaptive control method for object tracking of the foot robot under the condition of motor locked rotation is characterized by comprising the following steps:
(1) Constructing a dynamic simulation environment of the foot robot under the condition of motor locked rotor;
(2) Constructing a forward dynamics model of the foot robot by using the multilayer perceptron; the input of the forward dynamics model is the current state of the foot robotAnd an action performed by the robot in the current state +.>The output of the forward dynamics model is a robotExecute the current action->Post predicted state change +.>Obtaining the next state of the robot according to the predicted state variable quantity>
The training process of the forward dynamics model is as follows:
firstly, acquiring the current state of a robotAnd giving a random action->The robot interacts with the environment in the dynamic simulation environment to obtain the next step of state after interaction +.>Performing cyclic operation to obtain a series of interactive samples +.>As a training data set, performing a first training on the forward dynamics model;
the data set for executing the (i+1) th training is acquired by utilizing the forward dynamics model after the (i) th training and a cross entropy optimization algorithm, and the method comprises the following steps: acquiring current state of robotAnd a group of action values obtained by sampling under random Gaussian distribution are provided, and a predicted state corresponding to each action is obtained by utilizing a forward dynamics model after the ith round of training>Obtaining the lower layer according to the predicted stateThe value of the rewarding function; taking the increase of the lower layer reward function as a target, continuously optimizing the mean value and variance of the action distribution, taking the mean value of the action distribution of the last round as the optimal action after the number of the designated rounds is reached, forming an interaction sample by the current state of the robot, the optimal action and the next state after the robot executes the optimal action as a training data set, and training the forward dynamics model for the (i+1) th round, wherein i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;
the loss function of the training process is the predicted state change quantityAnd the actual state change amount delta t Square sum of the mean square error value of (2) and the network weight parameter;
(3) Training an upper controller based on a double-source model-free separation strategy reinforcement learning method, wherein the upper controller is used for outputting an upper control instruction according to interaction data of a robot and an environment; the double-source model-free separation strategy reinforcement learning method is realized based on a SAC-discrete model, and the training process of the SAC-discrete model comprises the following steps:
the first stage: acquiring a front image shot by a camera installed on a robot in a dynamic simulation environment, obtaining a relative distance d and a relative angle theta between the robot and a tracking target, and constructing training data based on a finite state machine principle, wherein the method comprises the following steps:
d is less than or equal to 20cm, when the absolute value of theta is less than or equal to 15 degrees, executing a backward instruction, when the absolute value of theta is more than 15 degrees, executing a left turn instruction, and when the absolute value of theta is less than-15 degrees, executing a right turn instruction;
under the condition that d is more than or equal to 25cm, executing a forward instruction when the angle of theta is less than or equal to 18 degrees, executing a left turn instruction when the angle of theta is more than 18 degrees, and executing a right turn instruction when the angle of theta is less than-18 degrees;
executing left turn instruction when theta is more than 10 degrees under the condition that d is more than 20cm and less than 25cm, executing right turn instruction when theta is less than-10 degrees, and executing static instruction when theta is less than or equal to 10 degrees;
stitching together the relative distance d and the relative angle θ between the robot and the tracking target as a robot and a ring in reinforcement learningInteraction data of a context, called stateAn upper control instruction is used as an action in reinforcement learning>Designing upper layer rewarding function r according to relation between robot and tracking target t u Construction of training samples for training SAC-discrete model>Training the SAC-discrete model in a first stage by using a training sample;
stage i+1: the SAC-discrete model after the i-th stage training is used for collecting training samples for the i+1-th stage training, and the training samples are specifically as follows:
acquiring a front image shot by a camera arranged on a robot in a dynamic simulation environment, obtaining a relative distance d and a relative angle theta between the robot and a tracking target, and splicing the relative distance d and the relative angle theta between the robot and the tracking target together to serve as interaction data of the robot and the environment in reinforcement learning, which is called a stateStatus +.>As an input of the SAC-discrete model after the training of the ith stage, outputting actions by the SAC-discrete model>Designing upper layer rewarding function r according to relation between robot and tracking target t u Construction of training sample +.1 for training in stage i+1>Continuing to train SAC-discrete model in (i+1) th stage by using training sampleTraining; i is more than or equal to 1; repeating the process until the preset training round number or convergence is reached;
(4) When the motor position control instruction is required to be selected, firstly, a trained upper controller is utilized to obtain an upper control instruction, a corresponding lower rewarding function is obtained according to the instruction, the forward dynamics model obtained in the step (2) is utilized to predict the state increment after a group of actions obtained by sampling from random Gaussian distribution are executed in the current state, the return value of the corresponding actions is obtained based on the state increment and the lower rewarding function, the return value is increased, the mean value and variance of the action distribution are continuously optimized until the designated number of rounds is reached, and the mean value of the actions distributed in the last round is used as the actions executed in the next step of the robot motor.
2. The self-adaptive control method for object tracking of foot robot under motor locked-rotor condition of claim 1, wherein the step (1) specifically comprises:
constructing a three-dimensional model of a base, a left forelimb, a left forecross connecting piece, a right forelimb, a right forecross connecting piece, a left hindlimb, a left rear cross connecting piece, a right hindlimb, a right rear cross connecting piece and a camera component of the foot robot, and exporting the three-dimensional model into corresponding STL files;
the method comprises the steps of constructing an XML file of a foot robot, importing an STL file constructed in the last step under a < asset > tag, connecting all components by using a limiting rotating joint under a < world > tag, and defining all parameters of the limiting rotating joint under a < actuator > tag; after the XML file is built, loading the XML file into a physical simulation environment, checking whether the pose of each component meets the expected requirement, and if not, adjusting the figure attribute value in the < body > tag under the < world > tag to adjust the pose, and adjusting the position of the pos attribute value until each component reaches the expected pose;
an interactive interface of the simulation environment and an algorithm is built, reset (& gt) is used for initializing the state of the simulation environment, close (& gt) is used for closing a simulation environment model, step (& gt) is used for executing one-step actions, change_env (& gt) is used for realizing motor locked-rotor conditions, and change_env (& gt) is called in a reset (& gt) function.
3. The adaptive control method for object tracking of foot robot under motor stalling condition of claim 1, wherein the current state of the robotThe method comprises the steps of including the speed of the mass center of a robot base, euler angles, a rotation matrix, and the positions and the speeds of all motors; action->The position control command for all motors of the robot.
4. The adaptive control method for object tracking of foot robot under motor stall condition of claim 1, wherein said upper control command comprises: forward instructions, backward instructions, left turn instructions, right turn instructions, and stationary instructions.
5. The self-adaptive control method for object tracking of foot robot under motor locked-rotor condition of claim 1, wherein the design method of the upper layer rewarding function is as follows:
setting the optimal distance between the robot and the tracking target as d opt Upper layer reward function r when robot collides with tracking target t u -500; upper layer reward function r when tracking target out of robot view t u -100; in other cases, the upper layer bonus function
6. The method for adaptively controlling object tracking of a foot robot under a motor locked-rotor condition according to claim 1, wherein in the step (3), an upper control command is obtained based on a dual-source model-free release strategy reinforcement learning method, and when the upper control command is a stationary command, the robot is directly caused to maintain the motor position of the previous step;
when the upper control instruction is a forward instruction or a backward instruction, the upper rewarding function is thatWherein w is i I=1, 2,3,4 denotes a weight coefficient, Δx base Representing the distance that the robot base centroid advances in the x-axis direction, (r, p, y) represents the euler angle of the robot base centroid, +.>Representing the sum of all joint velocities;
when the upper control command is a left turn command or a right turn command, the lower bonus function is designed asWherein y is des Indicating the target steering angle.
7. The method for adaptively controlling object tracking of foot robot under motor stall condition according to claim 1, wherein when motor motion selection is performed by cross entropy optimization method, a set of motion values obtained by sampling under random gaussian distribution is first given, and for each motion, the current state is combinedPredicting the next state using forward dynamics model +.>
Determining a lower prize function based on the upper control command and based on the predicted next stateCalculating the value of a lower layer rewarding function;
after obtaining the lower layer rewarding function value corresponding to each group of motor actions, selecting the action corresponding to the rewarding value of the first 20% with the highest rewarding value, calculating the mean value and variance of the actions, fitting the distribution of the actions by using Gaussian distribution, and then sampling from the Gaussian distribution to obtain a group of new actions;
repeating the steps until the preset step number is reached, and taking the latest mean value of the Gaussian distribution as the action executed by the motor in the next step.
CN202210441986.0A 2022-04-25 2022-04-25 Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition Active CN114895560B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210441986.0A CN114895560B (en) 2022-04-25 2022-04-25 Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210441986.0A CN114895560B (en) 2022-04-25 2022-04-25 Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition

Publications (2)

Publication Number Publication Date
CN114895560A CN114895560A (en) 2022-08-12
CN114895560B true CN114895560B (en) 2024-03-19

Family

ID=82717641

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210441986.0A Active CN114895560B (en) 2022-04-25 2022-04-25 Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition

Country Status (1)

Country Link
CN (1) CN114895560B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102607552A (en) * 2012-01-11 2012-07-25 南京航空航天大学 Industrial robot space grid precision compensation method based on neural network
CN105599821A (en) * 2016-01-06 2016-05-25 山东优宝特智能机器人有限公司 Electric driven biomimetic four-leg robot with environment sensing ability and control method
CN112297005A (en) * 2020-10-10 2021-02-02 杭州电子科技大学 Robot autonomous control method based on graph neural network reinforcement learning
CN112508164A (en) * 2020-07-24 2021-03-16 北京航空航天大学 End-to-end automatic driving model pre-training method based on asynchronous supervised learning
CN112631131A (en) * 2020-12-19 2021-04-09 北京化工大学 Motion control self-generation and physical migration method for quadruped robot
CN112936290A (en) * 2021-03-25 2021-06-11 西湖大学 Quadruped robot motion planning method based on layered reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102607552A (en) * 2012-01-11 2012-07-25 南京航空航天大学 Industrial robot space grid precision compensation method based on neural network
CN105599821A (en) * 2016-01-06 2016-05-25 山东优宝特智能机器人有限公司 Electric driven biomimetic four-leg robot with environment sensing ability and control method
CN112508164A (en) * 2020-07-24 2021-03-16 北京航空航天大学 End-to-end automatic driving model pre-training method based on asynchronous supervised learning
CN112297005A (en) * 2020-10-10 2021-02-02 杭州电子科技大学 Robot autonomous control method based on graph neural network reinforcement learning
CN112631131A (en) * 2020-12-19 2021-04-09 北京化工大学 Motion control self-generation and physical migration method for quadruped robot
CN112936290A (en) * 2021-03-25 2021-06-11 西湖大学 Quadruped robot motion planning method based on layered reinforcement learning

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
Autonomous Vehicle Cut-In Algorithm for Lane-Merging Scenarios via Policy-Based Reinforcement Learning Nested Within Finite-State Machine;Seulbin Hwang 等;IEEE Transactions on Intelligent Transportation Systems;20220308;第23卷(第10期);17594 - 17606 *
Nonlinear MPC-Based Control Framework for Quadruped Robots: Touch-Down in Complex Terrain;Jiyu Yu 等;2023 42nd Chinese Control Conference (CCC);20230718;4375-4381 *
一种改进BP算法在机械手逆运动学中的应用;吴爱国, 郝润生;中国工程科学;20050730(07);38-42 *
双足机器人步态规划及其应用研究;陈柏良;中国优秀硕士学位论文全文数据库 信息科技辑;20181215;第2-3章 *
基于多层感知器神经元的空间柔性机器人位置跟踪控制;张文辉 等;空间控制技术与应用;20110215;第37卷(第1期);59-62 *
欠驱动双足机器人高效行走控制研究;唐冲;中国博士学位论文全文数据库 信息科技辑;20150715;第4-5章 *

Also Published As

Publication number Publication date
CN114895560A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
Shankar et al. Learning robot skills with temporal variational inference
CN114888801B (en) Mechanical arm control method and system based on offline strategy reinforcement learning
CN111240356A (en) Unmanned aerial vehicle cluster convergence method based on deep reinforcement learning
CN111783994A (en) Training method and device for reinforcement learning
CN113580129B (en) Multi-target cooperative capturing method, device and medium based on robot
Wu et al. Learning foresightful dense visual affordance for deformable object manipulation
CN115618716A (en) Discrete SAC algorithm-based gazebo underwater vehicle path planning algorithm
Gu et al. GA-based learning in behaviour based robotics
CN114895560B (en) Object tracking self-adaptive control method for foot-type robot under motor locked-rotor condition
Osa et al. Deep reinforcement learning with adversarial training for automated excavation using depth images
CN113419424B (en) Modeling reinforcement learning robot control method and system for reducing overestimation
Xu et al. Learning strategy for continuous robot visual control: A multi-objective perspective
Liang et al. Learning sensorimotor primitives of sequential manipulation tasks from visual demonstrations
Bellegarda et al. An online training method for augmenting mpc with deep reinforcement learning
CN114170454A (en) Intelligent voxel action learning method based on joint grouping strategy
CN112733301B (en) Six-dimensional moment sensor gravity compensation method and system based on neural network
Gromniak et al. Deep reinforcement learning for mobile robot navigation
Zuo et al. Off-policy adversarial imitation learning for robotic tasks with low-quality demonstrations
CN116933441A (en) Digital twin data driving model modeling method for autonomous shoveling process of loader
CN115730521A (en) Intelligent design method of multi-legged robot based on hyper-multitask evolution optimization algorithm
Li et al. A function approximation method for model-based high-dimensional inverse reinforcement learning
Lee et al. Combining GRN modeling and demonstration-based programming for robot control
US11720069B2 (en) Method for the computer-aided control of a technical system, more particularly a power generation installation
Ilboudo et al. Adaptive t-momentum-based optimization for unknown ratio of outliers in amateur data in imitation learning
RU2816639C1 (en) Method for creating controllers for controlling walking robots based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Wang Yue

Inventor after: Chen Ci

Inventor after: Xiang Pingyu

Inventor after: Wang Dongqi

Inventor after: Yu Jiyu

Inventor after: Zhang Rui

Inventor after: Lu Haojian

Inventor after: Gao Hongbo

Inventor after: Xiong Rong

Inventor before: Wang Yue

Inventor before: Chen Ci

Inventor before: Xiang Pingyu

Inventor before: Wang Dongqi

Inventor before: Yu Jiyu

Inventor before: Zhang Rui

Inventor before: Lu Haojian

Inventor before: Xiong Rong

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant