CN114378820A - Robot impedance learning method based on safety reinforcement learning - Google Patents

Robot impedance learning method based on safety reinforcement learning Download PDF

Info

Publication number
CN114378820A
CN114378820A CN202210055753.7A CN202210055753A CN114378820A CN 114378820 A CN114378820 A CN 114378820A CN 202210055753 A CN202210055753 A CN 202210055753A CN 114378820 A CN114378820 A CN 114378820A
Authority
CN
China
Prior art keywords
robot
information
impedance
learning
input
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210055753.7A
Other languages
Chinese (zh)
Other versions
CN114378820B (en
Inventor
潘永平
冯晓欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202210055753.7A priority Critical patent/CN114378820B/en
Publication of CN114378820A publication Critical patent/CN114378820A/en
Application granted granted Critical
Publication of CN114378820B publication Critical patent/CN114378820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • B25J9/161Hardware, e.g. neural networks, fuzzy logic, interfaces, processor

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Automation & Control Theory (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a robot impedance learning method based on safety reinforcement learning, which comprises the following steps: the controller outputs control torque, and position information and speed information of a Cartesian space of the robot are calculated according to a robot dynamic equation; constructing an input item according to the position information, the speed information and the return information of the learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model. The invention has high stability, improves the feasibility of variable admittance control, and can be widely applied to the technical field of artificial intelligence.

Description

Robot impedance learning method based on safety reinforcement learning
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a robot impedance learning method based on safety reinforcement learning.
Background
Impedance control is an effective method for realizing interaction force control by a robot, and an expected impedance parameter is required to be given, so that an impedance controller is designed to control the robot to realize the expected interaction force. However, due to the unknown and uncertain external environment, it is usually necessary to introduce virtual flexibility into the system to ensure the safety of the interactive process.
The existing robot impedance learning method has the following defects:
1. conventional optimization methods (e.g., gradient descent optimization) require learning of impedance parameters with the environmental and robot models known.
2. The existing reinforcement learning algorithm applied to impedance learning, such as Deep Deterministic Policy Gradient (DDPG) algorithm, has certain defects: the Actor network used in the DDPG algorithm is not a most suitable algorithm for realizing impedance learning in a robot axis hole assembly task because of non-delayed updating and no noise added to the output action, which easily causes algorithm instability.
3. The conventional Probabilistic Inference for Learning COntrol (PILCO) algorithm for impedance Learning is a model-based reinforcement Learning algorithm, which requires that a Gaussian Process (GP) is used to model and predict a future state, and a certain model error exists in modeling.
4. The general reinforcement learning adds exploratory learning with random property in the solving process, and the exploratory learning without safety limit is likely to bring important risks. If the reinforcement learning method is directly applied to the real-world task, the intelligent agent can explore and learn in a trial-and-error manner, and the made decision can possibly cause the system to be in a dangerous state.
Disclosure of Invention
In view of this, the embodiment of the present invention provides a robot impedance learning method based on safety reinforcement learning, which has high stability, so as to implement feasibility of variable admittance control.
One aspect of the present invention provides a robot impedance learning method based on safety reinforcement learning, including:
the controller outputs control torque, and position information and speed information of a Cartesian space of the robot are calculated according to a robot dynamic equation;
constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
determining a decision action according to the input item, and further determining an impedance parameter;
taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;
taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of a controller according to the admittance model;
wherein the target input is for causing the controller to control the robot movement.
Optionally, the calculating the position information and the velocity information of the cartesian space of the robot according to the robot dynamics equation specifically includes:
calculating the position information and the speed information through a robot dynamics equation;
wherein the robot dynamics equation has the expression:
Figure BDA0003476138710000021
wherein ,
Figure BDA0003476138710000022
is the joint moment;
Figure BDA0003476138710000023
is a joint space inertia matrix;
Figure BDA0003476138710000024
is a Coriolis force and centripetal force coupling matrix;
Figure BDA0003476138710000025
is a gravity term;
Figure BDA0003476138710000026
a non-rigid body power term such as friction;
Figure BDA0003476138710000027
applying a moment to the environment in the joint space;
Figure BDA0003476138710000028
the robot joint space position and speed information;
the position information is Cartesian space position information, and a calculation formula of the Cartesian space position information is as follows:
x=ψ(q);
the calculation formula of the speed information is as follows:
Figure BDA0003476138710000029
wherein ,
Figure BDA00034761387100000210
representing location information;
Figure BDA00034761387100000211
representing speed information; psi (·) is a robot kinematics solution; j (q) is a Jacobian matrix.
Optionally, the determining a decision action according to the input item to determine an impedance parameter includes:
processing the state information of the input item;
evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain a decision action;
taking the decision action as an impedance parameter.
Optionally, the calculation formula of the environment interaction force is as follows:
Figure BDA00034761387100000212
wherein ,
Figure BDA00034761387100000213
representing environmental interaction forces;
Figure BDA00034761387100000214
are respectively provided withDamping, rigidity and inertia diagonal matrix of the robot admittance model;
Figure BDA00034761387100000215
is the desired interaction force;
Figure BDA00034761387100000216
respectively cartesian spatial position, velocity and acceleration of the robot;
Figure BDA00034761387100000217
respectively, a desired position, a desired velocity and a desired acceleration of the robot in cartesian space.
Optionally, the expression of the admittance model is:
Figure BDA0003476138710000031
wherein ,
Figure BDA0003476138710000032
respectively a reference position, a reference speed and a reference acceleration of the Cartesian space constrained by the robot after the robot is subjected to an external force.
Optionally, the evaluating and optimizing the processed input item through a Critic network and an Actor network to obtain a decision action, including:
respectively inputting the state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises position information and speed information of a current mechanical arm end effector;
inputting the first processing result into a first group of Critic networks to obtain a third processing result;
inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;
and adjusting the third processing result according to the fourth processing result to obtain a final decision action.
Optionally, the method further comprises the steps of: and safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the steps specifically comprise:
introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;
defining a set of feasible solutions according to the constraint threshold;
searching an optimal strategy according to the set of feasible solutions;
optimizing and adjusting the loss function according to the actual task;
and carrying out safety reinforcement learning according to the adjusted loss function.
In another aspect, an embodiment of the present invention further provides a robot impedance learning apparatus based on safety reinforcement learning, including:
the first module is used for outputting a control torque by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to a robot dynamic equation;
the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
a third module, configured to determine a decision action according to the input item, and further determine an impedance parameter;
the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module and calculating to obtain the environment interaction force;
the fifth module is used for taking the impedance parameters and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model;
wherein the target input is for causing the controller to control the robot movement.
Another aspect of the embodiments of the present invention further provides an electronic device, including a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Yet another aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores a program, which is executed by a processor to implement the method as described above.
The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.
The controller of the embodiment of the invention outputs the control torque, and calculates the position information and the speed information of the Cartesian space of the robot according to the kinetic equation of the robot; constructing an input item according to the position information, the speed information and the return information of the learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model. The invention has high stability and improves the feasibility of variable admittance control.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a diagram of a learning control framework provided by an embodiment of the present invention;
fig. 2 is a schematic view of an axis hole assembly of a robot provided in an embodiment of the present invention;
fig. 3 is a network architecture diagram of a dual delay depth Deterministic Policy Gradient (TD 3) algorithm according to an embodiment of the present invention;
FIG. 4 is a block diagram of a constrained Markov decision process according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
First, the related terms related to the present invention will be explained:
admittance control: the compliance control is a robot interaction control method, the compliance control does not try to independently control the position and the interaction force of the robot, but aims to realize the specified dynamic relation between the interaction force and the position error, namely, the robot is subjected to impedance or admittance shaping, and virtual compliance is introduced into the system in a control mode so as to ensure the safety of the interaction process. The compliance control is divided into impedance control based on force control and impedance control based on position control, the latter is called admittance control for short, and the idea is that the robot in the interaction process is modulated into a second-order admittance model in a control mode, wherein the second-order admittance model comprises three impedance parameters of inertia, damping and rigidity.
Variable admittance control: the fixed admittance model guides the inertia, damping and rigidity parameters in the admittance model to be fixed and unchanged. Under many conditions, the interaction control of the fixed admittance model cannot achieve the expected effect of adapting to the environment and the task, so that a variable admittance control concept is introduced, and the impedance parameters are adjusted according to the specific environment and the task, so that the robot can be more compliant with the environmental force, and the compliant operation under the unknown dynamic environment is realized.
The impedance learning method comprises the following steps: the process of adjusting the impedance parameters is commonly referred to as impedance learning. Common impedance adjusting methods include simulation learning, iterative learning, gradient descent optimization, neural network, reinforcement learning and the like.
Reinforcement learning: the method can overcome the limitation caused by the fact that the traditional optimal control algorithm cannot completely model the environment, and finds the optimal solution through interaction with the environment. In robotic applications, one of the main objectives of reinforcement learning is to make the robot interact with the environment completely autonomously, and its important feature is to learn the best behavior without human involvement, without knowing the model of the robot and the environmental system. In the robot variable impedance learning task, the main purpose of reinforcement learning is to autonomously learn and adjust the impedance parameters of the robot to show more appropriate flexibility.
Constrained Markov Decision Process (CMDP): markov Decision Processes (MDPs) are mathematical models of sequential decisions that model the randomness strategies and returns achievable by agents in an environment where the system state is Markov in nature, almost all reinforcement learning problems can be translated into MDPs, which are used to model reinforcement learning problems. The MDP can solve the agent policy that maximizes the return by using methods such as dynamic programming, random sampling, etc. And a Constraint Markov Decision Process (CMDP) additionally introduces a loss function and a constraint, and an objective function of the CMDP problem is the maximized long-term yield under the condition of meeting all the constraints.
Aiming at the problems in the prior art, the embodiment of the invention provides a robot impedance learning method based on safety reinforcement learning, which comprises the following steps:
the controller outputs control torque, and position information and speed information of a Cartesian space of the robot are calculated according to a robot dynamic equation;
constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
determining a decision action according to the input item, and further determining an impedance parameter;
taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;
taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of a controller according to the admittance model;
wherein the target input is for causing the controller to control the robot movement.
Optionally, the calculating the position information and the velocity information of the cartesian space of the robot according to the robot dynamics equation specifically includes:
calculating the position information and the speed information through a robot dynamics equation;
wherein the robot dynamics equation has the expression:
Figure BDA0003476138710000061
wherein ,
Figure BDA0003476138710000062
is the joint moment;
Figure BDA0003476138710000063
is a joint space inertia matrix;
Figure BDA0003476138710000064
is a Coriolis force and centripetal force coupling matrix;
Figure BDA0003476138710000065
is a gravity term;
Figure BDA0003476138710000066
a non-rigid body power term such as friction;
Figure BDA0003476138710000067
applying a moment to the environment in the joint space;
Figure BDA0003476138710000068
the robot joint space position and speed information;
the position information is Cartesian space position information, and a calculation formula of the Cartesian space position information is as follows:
x=ψ(q);
the calculation formula of the speed information is as follows:
Figure BDA0003476138710000069
wherein ,
Figure BDA00034761387100000610
representing location information;
Figure BDA00034761387100000611
representing speed information; psi (·) is a robot kinematics solution; j (q) is a Jacobian matrix.
Optionally, the determining a decision action according to the input item to determine an impedance parameter includes:
processing the state information of the input item;
evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain a decision action;
taking the decision action as an impedance parameter.
Optionally, the calculation formula of the environment interaction force is as follows:
Figure BDA00034761387100000612
wherein ,
Figure BDA00034761387100000613
representing environmental interaction forces;
Figure BDA00034761387100000614
respectively representing damping, rigidity and inertia diagonal matrix of the robot admittance model;
Figure BDA00034761387100000615
for desired interactionForce;
Figure BDA00034761387100000616
respectively cartesian spatial position, velocity and acceleration of the robot;
Figure BDA00034761387100000617
respectively, a desired position, a desired velocity and a desired acceleration of the robot in cartesian space.
Optionally, the expression of the admittance model is:
Figure BDA00034761387100000618
wherein ,
Figure BDA00034761387100000619
respectively a reference position, a reference speed and a reference acceleration of the Cartesian space constrained by the robot after the robot is subjected to an external force.
Optionally, the evaluating and optimizing the processed input item through a Critic network and an Actor network to obtain a decision action, including:
respectively inputting the state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises position information and speed information of a current mechanical arm end effector;
inputting the first processing result into a first group of Critic networks to obtain a third processing result;
inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;
and adjusting the third processing result according to the fourth processing result to obtain a final decision action.
Optionally, the method further comprises the steps of: and safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the steps specifically comprise:
introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;
defining a set of feasible solutions according to the constraint threshold;
searching an optimal strategy according to the set of feasible solutions;
optimizing and adjusting the loss function according to the actual task;
and carrying out safety reinforcement learning according to the adjusted loss function.
In another aspect, an embodiment of the present invention further provides a robot impedance learning apparatus based on safety reinforcement learning, including:
the first module is used for outputting a control torque by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to a robot dynamic equation;
the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
a third module, configured to determine a decision action according to the input item, and further determine an impedance parameter;
the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module and calculating to obtain the environment interaction force;
the fifth module is used for taking the impedance parameters and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model;
wherein the target input is for causing the controller to control the robot movement.
Another aspect of the embodiments of the present invention further provides an electronic device, including a processor and a memory;
the memory is used for storing programs;
the processor executes the program to implement the method as described above.
Yet another aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores a program, which is executed by a processor to implement the method as described above.
The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.
The following detailed description of the embodiments of the present invention is made with reference to the accompanying drawings:
as shown in fig. 1, fig. 1 is a learning control framework diagram according to an embodiment of the present invention.
Specifically, the ensemble learning control flow is as follows:
1) the inner ring is a control ring, so that the robot system with unknown dynamic characteristics shows the behavior of a specified admittance model, namely the controller outputs a control moment tau, and the actual position x and the speed of the Cartesian space of the robot are calculated according to a robot dynamic equation
Figure BDA0003476138710000081
The robot kinetic equation is
Figure BDA0003476138710000082
wherein ,
Figure BDA0003476138710000083
in order to obtain the moment of the joint,
Figure BDA0003476138710000084
is a space inertia matrix of the joint,
Figure BDA0003476138710000085
is a matrix of coriolis forces and centripetal force couplings,
Figure BDA0003476138710000086
in the term of the gravity force,
Figure BDA0003476138710000087
is a non-rigid body power item such as friction,
Figure BDA0003476138710000088
a moment acts on the environment in the joint space,
Figure BDA0003476138710000089
for the robot joint space position and speed information, the corresponding Cartesian space position and speed of the robot are obtained through a conversion model, and the specific formula is as follows:
x=ψ(q),
Figure BDA00034761387100000810
where ψ (·) is the robot kinematics solution and J (q) is the Jacobian matrix.
2) Actual position x and speed
Figure BDA00034761387100000811
As an input item of the reinforcement learning algorithm, the reward r is also used as an input item of the reinforcement learning algorithm, the input state information is processed and then subjected to criticic network evaluation and Actor network output decision-making action, namely, an impedance parameter K, and a specific algorithm implementation process is shown in fig. 3.
Fig. 3 is a network architecture in the dual-delay depth deterministic policy gradient TD3 algorithm.
The present embodiment learns the impedance parameter using a double delay depth Deterministic Policy Gradient (TD 3) algorithm, wherein the state S includes the current position x and velocity of the end effector of the robot arm
Figure BDA00034761387100000812
And an action K output at the last moment; a represents the strategy output by the Actor network; a' represents the output of the Actor target networkA policy; q1 represents the computed value function evaluated by Critic network 1; q2 represents the computed value function evaluated by Critic network 2; q' represents a target value function; r represents an instant reward; TD _ error1 represents the error of the weighted sum of R and Q' subtracted from Q1; TD _ error2 represents the error of the weighted sum of R and Q' subtracted from Q2; target represents the weighted sum of R and Q'; the difference between the Actor network and the Actor target network is that the Actor network is updated in an experience pool at each step, and the Actor target network copies network parameters of the Actor network to the Actor target network at intervals to realize the update of the Actor target network; the Critic network 1 and the Critic network 2 respectively and independently update network parameters by using the same target value function; the difference between the criticic network 1 and the criticic target network 1 is that the criticic network 1 is updated in an experience pool at each step, and the criticic target network 1 copies the network parameters of the criticic network 1 into the criticic target network 1 at intervals to realize the updating of the criticic target network 1. The reward function defined in this embodiment is:
r=-a*||(Fe-Fd)2||-b*||(x-xd)2||-c*||(x-xobj)2||+rfinal
wherein ,
Figure BDA0003476138710000091
indicates the target position, rfinalIs a positive integer.
The return function comprises four items in total, the first three items represent the instant return of each step and are respectively used for punishing the behaviors of generating large interaction force, deviating from an expected track and being far away from a target position, and the last item represents that a task is completed within a specified time, namely, a reward is given when the target position is reached. The purpose of this reward function is to encourage the act of moving towards the aperture, while inhibiting the act of creating a large interaction.
3) And in addition, the actual position x and the speed of the mechanical arm end effector
Figure BDA00034761387100000917
As input to the environment module, an environmental interaction force F is calculatede(ii) a The environment interaction force is designed as follows:
Figure BDA0003476138710000092
wherein ,
Figure BDA0003476138710000093
respectively damping, rigidity and inertia diagonal matrix of the robot admittance model,
Figure BDA0003476138710000094
in order to expect the interaction force,
Figure BDA0003476138710000095
respectively cartesian spatial position, velocity and acceleration of the robot,
Figure BDA0003476138710000096
respectively, a desired position, a desired velocity and a desired acceleration of the robot in cartesian space.
4) The impedance parameter K and the environment interaction force FeAs input to the admittance model, a reference position x is calculated from the admittance modelrAnd a reference speed
Figure BDA0003476138710000097
And serves as an input to the controller. The admittance model is as follows:
Figure BDA0003476138710000098
wherein ,
Figure BDA0003476138710000099
respectively a reference position, a reference speed and a reference acceleration of the Cartesian space constrained by the robot after the robot is subjected to an external force. Calculating reference position x according to admittance modelrAnd a reference speed
Figure BDA00034761387100000910
The specific steps are that the solution is carried out by an integral method, and the last time t-tau is usedsReference speed of
Figure BDA00034761387100000911
And a reference position xr(t-τs) Calculating the reference acceleration of the current time t
Figure BDA00034761387100000912
Then, the reference speed of the current moment is obtained through integral operation
Figure BDA00034761387100000913
And a reference position xr(t), the formula is as follows:
Figure BDA00034761387100000914
Figure BDA00034761387100000915
Figure BDA00034761387100000916
the final purpose of the scheme is to learn the impedance parameters of the robot in the task execution process in a reinforcement learning mode, enhance the compliance effect of the robot and guarantee the safety of the robot and the environment. After hundreds of rounds of training, the learning curve is converged, and the optimal motion trajectory and environment interaction force change curve of the robot can be obtained through simulation, so that the whole task is completed.
Fig. 2 is a schematic diagram of robot axis hole assembly, and an axis hole assembly task of the robot is taken as an environmental background. The initial position of the tail end of the mechanical arm is positioned at x0The desired trajectory is first from x0To x1Then from x1To x2When the mechanical arm deviates from the expected track, the mechanical arm is subjected to the environmental interaction force generated by the collision with the wall, and the mechanical arm is produced by adjusting the impedance parameterThe effect of softening is generated, and the environmental interaction force is reduced.
FIG. 4 is a schematic diagram of a constrained Markov decision process framework, which is combined with a constrained Markov decision process algorithm to implement security reinforcement learning.
Compared with the general Markov decision process, the invention additionally introduces a loss function c in the constraint Markov decision process and sets a constraint threshold value d. Order to
Figure BDA0003476138710000101
For long-term loss of the policy pi within the constraints, the set of feasible solutions is then defined as |, c ═ pi ∈ pi:, JC(π)D is less than or equal to. Then, under the condition of meeting the constraint condition, searching a strategy pi for maximizing the long-term income*=argmaxπ∈∏cJ (. pi.). Depending on the specific task, the loss function is designed as c ═ w | (F)e-Fd)2Where w is a parameter for adjusting the weight, the purpose of designing the loss function is to constrain the environmental interaction force to a safe range, and then to implement the learning of the impedance parameter within the safe range.
In conclusion, the TD3 algorithm is applied to the impedance learning of the Panda robot for the first time, so that the feasibility of impedance learning of the Panda robot based on a reinforcement learning mode and further variable admittance control is verified; the invention also applies the safety reinforcement learning idea to impedance learning, combines the CMDP and the TD3 algorithm, and applies the combination to the impedance learning task in the robot shaft hole assembling process, thereby ensuring the safety of the shaft hole assembling task.
Compared with the prior art, the invention has the following advantages:
firstly, impedance learning based on a TD3 algorithm is applied to a Panda robot simulation platform for the first time, and feasibility of impedance learning of the Panda robot based on a reinforcement learning mode and then variable admittance control is verified; secondly, the deep reinforcement learning (namely TD3) algorithm has higher performance, is more stable, learns the optimal impedance parameter more quickly and is more suitable for being applied to an impedance learning task in the assembly process of the shaft hole of the robot; the safety reinforcement learning idea is introduced, and the CMDP is combined, so that the safety guarantee effect is achieved, and the robot shaft hole equipment process is safer.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A robot impedance learning method based on safety reinforcement learning is characterized by comprising the following steps:
the controller outputs control torque, and position information and speed information of a Cartesian space of the robot are calculated according to a robot dynamic equation;
constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
determining a decision action according to the input item, and further determining an impedance parameter;
taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;
taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of a controller according to the admittance model;
wherein the target input is for causing the controller to control the robot movement.
2. The robot impedance learning method based on safety reinforcement learning of claim 1, wherein the position information and the speed information of the cartesian space of the robot are calculated according to a robot dynamics equation, specifically:
calculating the position information and the speed information through a robot dynamics equation;
wherein the robot dynamics equation has the expression:
Figure FDA0003476138700000011
wherein ,
Figure FDA0003476138700000012
is the joint moment;
Figure FDA0003476138700000013
is a joint space inertia matrix;
Figure FDA0003476138700000014
is a Coriolis force and centripetal force coupling matrix;
Figure FDA0003476138700000015
is a gravity term;
Figure FDA0003476138700000016
a non-rigid body power term such as friction;
Figure FDA0003476138700000017
applying a moment to the environment in the joint space;
Figure FDA0003476138700000018
Figure FDA0003476138700000019
the robot joint space position and speed information;
the position information is Cartesian space position information, and a calculation formula of the Cartesian space position information is as follows:
x=ψ(q);
the calculation formula of the speed information is as follows:
Figure FDA00034761387000000110
wherein ,
Figure FDA00034761387000000111
representing location information;
Figure FDA00034761387000000112
representing speed information; psi (·) is a robot kinematics solution; j (q) is a Jacobian matrix.
3. The robot impedance learning method based on safety reinforcement learning of claim 1, wherein the determining a decision action according to the input item and further determining an impedance parameter comprises:
processing the state information of the input item;
evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain a decision action;
taking the decision action as an impedance parameter.
4. The robot impedance learning method based on safety reinforcement learning of claim 1, wherein the calculation formula of the environment interaction force is as follows:
Figure FDA0003476138700000021
wherein ,
Figure FDA0003476138700000022
representing environmental interaction forces;
Figure FDA0003476138700000023
respectively representing damping, rigidity and inertia diagonal matrix of the robot admittance model;
Figure FDA0003476138700000024
is the desired interaction force;
Figure FDA0003476138700000025
respectively cartesian spatial position, velocity and acceleration of the robot;
Figure FDA0003476138700000026
respectively, a desired position, a desired velocity and a desired acceleration of the robot in cartesian space.
5. The robot impedance learning method based on safety reinforcement learning of claim 4, wherein the expression of the admittance model is as follows:
Figure FDA0003476138700000027
wherein ,
Figure FDA0003476138700000028
respectively a reference position, a reference speed and a reference acceleration of the Cartesian space constrained by the robot after the robot is subjected to an external force.
6. The robot impedance learning method based on safety reinforcement learning of claim 3, wherein the evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain a decision action comprises:
respectively inputting the state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises position information and speed information of a current mechanical arm end effector;
inputting the first processing result into a first group of Critic networks to obtain a third processing result;
inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;
and adjusting the third processing result according to the fourth processing result to obtain a final decision action.
7. The robot impedance learning method based on safety reinforcement learning as claimed in claim 6, further comprising the following steps: and safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the steps specifically comprise:
introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;
defining a set of feasible solutions according to the constraint threshold;
searching an optimal strategy according to the set of feasible solutions;
optimizing and adjusting the loss function according to the actual task;
and carrying out safety reinforcement learning according to the adjusted loss function.
8. A robot impedance learning device based on safety reinforcement learning is characterized by comprising:
the first module is used for outputting a control torque by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to a robot dynamic equation;
the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;
a third module, configured to determine a decision action according to the input item, and further determine an impedance parameter;
the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module and calculating to obtain the environment interaction force;
the fifth module is used for taking the impedance parameters and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model;
wherein the target input is for causing the controller to control the robot movement.
9. An electronic device comprising a processor and a memory;
the memory is used for storing programs;
the processor executing the program realizes the method of any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method according to any one of claims 1 to 7.
CN202210055753.7A 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning Active CN114378820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210055753.7A CN114378820B (en) 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210055753.7A CN114378820B (en) 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning

Publications (2)

Publication Number Publication Date
CN114378820A true CN114378820A (en) 2022-04-22
CN114378820B CN114378820B (en) 2023-06-06

Family

ID=81203767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210055753.7A Active CN114378820B (en) 2022-01-18 2022-01-18 Robot impedance learning method based on safety reinforcement learning

Country Status (1)

Country Link
CN (1) CN114378820B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115421387A (en) * 2022-09-22 2022-12-02 中国科学院自动化研究所 Variable impedance control system and control method based on inverse reinforcement learning

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153153A (en) * 2017-12-19 2018-06-12 哈尔滨工程大学 A kind of study impedance control system and control method
US20200101603A1 (en) * 2018-10-02 2020-04-02 Fanuc Corporation Controller and control system
CN112757344A (en) * 2021-01-20 2021-05-07 清华大学 Robot interference shaft hole assembling method and device based on force position state mapping model
CN112847235A (en) * 2020-12-25 2021-05-28 山东大学 Robot step force guiding assembly method and system based on deep reinforcement learning
CN113341706A (en) * 2021-05-06 2021-09-03 东华大学 Man-machine cooperation assembly line system based on deep reinforcement learning
CN113352322A (en) * 2021-05-19 2021-09-07 浙江工业大学 Adaptive man-machine cooperation control method based on optimal admittance parameters
CN113510704A (en) * 2021-06-25 2021-10-19 青岛博晟优控智能科技有限公司 Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108153153A (en) * 2017-12-19 2018-06-12 哈尔滨工程大学 A kind of study impedance control system and control method
US20200101603A1 (en) * 2018-10-02 2020-04-02 Fanuc Corporation Controller and control system
CN112847235A (en) * 2020-12-25 2021-05-28 山东大学 Robot step force guiding assembly method and system based on deep reinforcement learning
CN112757344A (en) * 2021-01-20 2021-05-07 清华大学 Robot interference shaft hole assembling method and device based on force position state mapping model
CN113341706A (en) * 2021-05-06 2021-09-03 东华大学 Man-machine cooperation assembly line system based on deep reinforcement learning
CN113352322A (en) * 2021-05-19 2021-09-07 浙江工业大学 Adaptive man-machine cooperation control method based on optimal admittance parameters
CN113510704A (en) * 2021-06-25 2021-10-19 青岛博晟优控智能科技有限公司 Industrial mechanical arm motion planning method based on reinforcement learning algorithm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIXU DENG: "Sparse Online Gaussian Process Impedance Learning for Multi-DoF Robotic Arms", INTERNATIONAL CONFERENCE ON ADVANCED ROBOTICS AND MECHATRONICS *
杜志江;王伟;闫志远;董为;王伟东;: "基于模糊强化学习的微创外科手术机械臂人机交互方法", 机器人 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115421387A (en) * 2022-09-22 2022-12-02 中国科学院自动化研究所 Variable impedance control system and control method based on inverse reinforcement learning

Also Published As

Publication number Publication date
CN114378820B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
US11584008B1 (en) Simulation-real world feedback loop for learning robotic control policies
US11707838B1 (en) Artificial intelligence system for efficiently learning robotic control policies
Wang et al. A survey of learning‐based robot motion planning
JP2009288934A (en) Data processing apparatus, data processing method, and computer program
Kilinc et al. Reinforcement learning for robotic manipulation using simulated locomotion demonstrations
US9747543B1 (en) System and method for controller adaptation
CN115812180A (en) Robot-controlled offline learning using reward prediction model
CN110716575A (en) UUV real-time collision avoidance planning method based on deep double-Q network reinforcement learning
CN114521262A (en) Controlling an agent using a causal correct environment model
Dai et al. Robust control of underwater vehicle‐manipulator system using grey wolf optimizer‐based nonlinear disturbance observer and H‐infinity controller
CN114378820A (en) Robot impedance learning method based on safety reinforcement learning
Ranjbar et al. Residual feedback learning for contact-rich manipulation tasks with uncertainty
CN114779792B (en) Medical robot autonomous obstacle avoidance method and system based on simulation and reinforcement learning
Li et al. Research on the agricultural machinery path tracking method based on deep reinforcement learning
CN114967472A (en) Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method
Toner et al. Opportunities and challenges in applying reinforcement learning to robotic manipulation: An industrial case study
Solovyeva et al. Controlling system based on neural networks with reinforcement learning for robotic manipulator
Fernandez et al. Deep reinforcement learning with linear quadratic regulator regions
Yu et al. Deep Q‐Network with Predictive State Models in Partially Observable Domains
dos Santos et al. A stochastic learning approach for construction of brick structures with a ground robot
D’Silva et al. Learning dynamic obstacle avoidance for a robot arm using neuroevolution
Ali et al. Tree-select Trial and Error Algorithm for Adaptation to Failures of Redundant Manipulators
Huang et al. Trade-off on Sim2Real Learning: Real-world Learning Faster than Simulations
Gillen Improving Reinforcement Learning for Robotics with Control and Dynamical Systems Theory
Sun et al. Introduction to Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant