CN114378820A

CN114378820A - Robot impedance learning method based on safety reinforcement learning

Info

Publication number: CN114378820A
Application number: CN202210055753.7A
Authority: CN
Inventors: 潘永平; 冯晓欣
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-01-18
Filing date: 2022-01-18
Publication date: 2022-04-22
Anticipated expiration: 2042-01-18
Also published as: CN114378820B

Abstract

The invention discloses a robot impedance learning method based on safety reinforcement learning, which comprises the following steps: the controller outputs control torque, and position information and speed information of a Cartesian space of the robot are calculated according to a robot dynamic equation; constructing an input item according to the position information, the speed information and the return information of the learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model. The invention has high stability, improves the feasibility of variable admittance control, and can be widely applied to the technical field of artificial intelligence.

Description

Robot impedance learning method based on safety reinforcement learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a robot impedance learning method based on safety reinforcement learning.

Background

Impedance control is an effective method for realizing interaction force control by a robot, and an expected impedance parameter is required to be given, so that an impedance controller is designed to control the robot to realize the expected interaction force. However, due to the unknown and uncertain external environment, it is usually necessary to introduce virtual flexibility into the system to ensure the safety of the interactive process.

The existing robot impedance learning method has the following defects:

1. conventional optimization methods (e.g., gradient descent optimization) require learning of impedance parameters with the environmental and robot models known.

2. The existing reinforcement learning algorithm applied to impedance learning, such as Deep Deterministic Policy Gradient (DDPG) algorithm, has certain defects: the Actor network used in the DDPG algorithm is not a most suitable algorithm for realizing impedance learning in a robot axis hole assembly task because of non-delayed updating and no noise added to the output action, which easily causes algorithm instability.

3. The conventional Probabilistic Inference for Learning COntrol (PILCO) algorithm for impedance Learning is a model-based reinforcement Learning algorithm, which requires that a Gaussian Process (GP) is used to model and predict a future state, and a certain model error exists in modeling.

4. The general reinforcement learning adds exploratory learning with random property in the solving process, and the exploratory learning without safety limit is likely to bring important risks. If the reinforcement learning method is directly applied to the real-world task, the intelligent agent can explore and learn in a trial-and-error manner, and the made decision can possibly cause the system to be in a dangerous state.

Disclosure of Invention

In view of this, the embodiment of the present invention provides a robot impedance learning method based on safety reinforcement learning, which has high stability, so as to implement feasibility of variable admittance control.

One aspect of the present invention provides a robot impedance learning method based on safety reinforcement learning, including:

the controller outputs control torque, and position information and speed information of a Cartesian space of the robot are calculated according to a robot dynamic equation;

constructing an input item according to the position information, the speed information and the return information of the learning algorithm;

determining a decision action according to the input item, and further determining an impedance parameter;

taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force;

taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of a controller according to the admittance model;

wherein the target input is for causing the controller to control the robot movement.

Optionally, the calculating the position information and the velocity information of the cartesian space of the robot according to the robot dynamics equation specifically includes:

calculating the position information and the speed information through a robot dynamics equation;

wherein the robot dynamics equation has the expression:

wherein ,

is the joint moment;

is a joint space inertia matrix;

is a Coriolis force and centripetal force coupling matrix;

is a gravity term;

a non-rigid body power term such as friction;

applying a moment to the environment in the joint space;

the robot joint space position and speed information;

the position information is Cartesian space position information, and a calculation formula of the Cartesian space position information is as follows:

x＝ψ(q)；

the calculation formula of the speed information is as follows:

wherein ,

representing location information;

representing speed information; psi (·) is a robot kinematics solution; j (q) is a Jacobian matrix.

Optionally, the determining a decision action according to the input item to determine an impedance parameter includes:

processing the state information of the input item;

evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain a decision action;

taking the decision action as an impedance parameter.

Optionally, the calculation formula of the environment interaction force is as follows:

wherein ,

representing environmental interaction forces;

are respectively provided withDamping, rigidity and inertia diagonal matrix of the robot admittance model;

is the desired interaction force;

respectively cartesian spatial position, velocity and acceleration of the robot;

respectively, a desired position, a desired velocity and a desired acceleration of the robot in cartesian space.

Optionally, the expression of the admittance model is:

wherein ,

respectively a reference position, a reference speed and a reference acceleration of the Cartesian space constrained by the robot after the robot is subjected to an external force.

Optionally, the evaluating and optimizing the processed input item through a Critic network and an Actor network to obtain a decision action, including:

respectively inputting the state information of the robot into a first Actor network and a second Actor network to respectively obtain a first processing result and a second processing result; the state information of the robot comprises position information and speed information of a current mechanical arm end effector;

inputting the first processing result into a first group of Critic networks to obtain a third processing result;

inputting the second processing result into a second group of Critic networks to obtain a fourth processing result;

and adjusting the third processing result according to the fourth processing result to obtain a final decision action.

Optionally, the method further comprises the steps of: and safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the steps specifically comprise:

introducing a loss function in a constraint Markov decision process, and configuring a constraint threshold of the loss function;

defining a set of feasible solutions according to the constraint threshold;

searching an optimal strategy according to the set of feasible solutions;

optimizing and adjusting the loss function according to the actual task;

and carrying out safety reinforcement learning according to the adjusted loss function.

In another aspect, an embodiment of the present invention further provides a robot impedance learning apparatus based on safety reinforcement learning, including:

the first module is used for outputting a control torque by the controller and calculating the position information and the speed information of the Cartesian space of the robot according to a robot dynamic equation;

the second module is used for constructing an input item according to the position information, the speed information and the return information of the learning algorithm;

a third module, configured to determine a decision action according to the input item, and further determine an impedance parameter;

the fourth module is used for taking the position information and the speed information of the mechanical arm end effector as the input of the environment module and calculating to obtain the environment interaction force;

the fifth module is used for taking the impedance parameters and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model;

Another aspect of the embodiments of the present invention further provides an electronic device, including a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

Yet another aspect of the embodiments of the present invention provides a computer-readable storage medium, which stores a program, which is executed by a processor to implement the method as described above.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.

The controller of the embodiment of the invention outputs the control torque, and calculates the position information and the speed information of the Cartesian space of the robot according to the kinetic equation of the robot; constructing an input item according to the position information, the speed information and the return information of the learning algorithm; determining a decision action according to the input item, and further determining an impedance parameter; taking the position information and the speed information of the mechanical arm end effector as the input of an environment module, and calculating to obtain environment interaction force; and taking the impedance parameter and the environmental interaction force as the input of an admittance model, and determining a reference position and a reference speed as the target input of the controller according to the admittance model. The invention has high stability and improves the feasibility of variable admittance control.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram of a learning control framework provided by an embodiment of the present invention;

fig. 2 is a schematic view of an axis hole assembly of a robot provided in an embodiment of the present invention;

fig. 3 is a network architecture diagram of a dual delay depth Deterministic Policy Gradient (TD 3) algorithm according to an embodiment of the present invention;

FIG. 4 is a block diagram of a constrained Markov decision process according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

First, the related terms related to the present invention will be explained:

admittance control: the compliance control is a robot interaction control method, the compliance control does not try to independently control the position and the interaction force of the robot, but aims to realize the specified dynamic relation between the interaction force and the position error, namely, the robot is subjected to impedance or admittance shaping, and virtual compliance is introduced into the system in a control mode so as to ensure the safety of the interaction process. The compliance control is divided into impedance control based on force control and impedance control based on position control, the latter is called admittance control for short, and the idea is that the robot in the interaction process is modulated into a second-order admittance model in a control mode, wherein the second-order admittance model comprises three impedance parameters of inertia, damping and rigidity.

Variable admittance control: the fixed admittance model guides the inertia, damping and rigidity parameters in the admittance model to be fixed and unchanged. Under many conditions, the interaction control of the fixed admittance model cannot achieve the expected effect of adapting to the environment and the task, so that a variable admittance control concept is introduced, and the impedance parameters are adjusted according to the specific environment and the task, so that the robot can be more compliant with the environmental force, and the compliant operation under the unknown dynamic environment is realized.

The impedance learning method comprises the following steps: the process of adjusting the impedance parameters is commonly referred to as impedance learning. Common impedance adjusting methods include simulation learning, iterative learning, gradient descent optimization, neural network, reinforcement learning and the like.

Reinforcement learning: the method can overcome the limitation caused by the fact that the traditional optimal control algorithm cannot completely model the environment, and finds the optimal solution through interaction with the environment. In robotic applications, one of the main objectives of reinforcement learning is to make the robot interact with the environment completely autonomously, and its important feature is to learn the best behavior without human involvement, without knowing the model of the robot and the environmental system. In the robot variable impedance learning task, the main purpose of reinforcement learning is to autonomously learn and adjust the impedance parameters of the robot to show more appropriate flexibility.

Constrained Markov Decision Process (CMDP): markov Decision Processes (MDPs) are mathematical models of sequential decisions that model the randomness strategies and returns achievable by agents in an environment where the system state is Markov in nature, almost all reinforcement learning problems can be translated into MDPs, which are used to model reinforcement learning problems. The MDP can solve the agent policy that maximizes the return by using methods such as dynamic programming, random sampling, etc. And a Constraint Markov Decision Process (CMDP) additionally introduces a loss function and a constraint, and an objective function of the CMDP problem is the maximized long-term yield under the condition of meeting all the constraints.

Aiming at the problems in the prior art, the embodiment of the invention provides a robot impedance learning method based on safety reinforcement learning, which comprises the following steps:

wherein the robot dynamics equation has the expression:

wherein ,

is the joint moment;

is a joint space inertia matrix;

is a Coriolis force and centripetal force coupling matrix;

is a gravity term;

a non-rigid body power term such as friction;

applying a moment to the environment in the joint space;

the robot joint space position and speed information;

x＝ψ(q)；

the calculation formula of the speed information is as follows:

wherein ,

representing location information;

processing the state information of the input item;

taking the decision action as an impedance parameter.

wherein ,

representing environmental interaction forces;

respectively representing damping, rigidity and inertia diagonal matrix of the robot admittance model;

for desired interactionForce;

Optionally, the expression of the admittance model is:

wherein ,

defining a set of feasible solutions according to the constraint threshold;

searching an optimal strategy according to the set of feasible solutions;

optimizing and adjusting the loss function according to the actual task;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

The following detailed description of the embodiments of the present invention is made with reference to the accompanying drawings:

as shown in fig. 1, fig. 1 is a learning control framework diagram according to an embodiment of the present invention.

Specifically, the ensemble learning control flow is as follows:

1) the inner ring is a control ring, so that the robot system with unknown dynamic characteristics shows the behavior of a specified admittance model, namely the controller outputs a control moment tau, and the actual position x and the speed of the Cartesian space of the robot are calculated according to a robot dynamic equation

The robot kinetic equation is

wherein ,

in order to obtain the moment of the joint,

is a space inertia matrix of the joint,

is a matrix of coriolis forces and centripetal force couplings,

in the term of the gravity force,

is a non-rigid body power item such as friction,

a moment acts on the environment in the joint space,

for the robot joint space position and speed information, the corresponding Cartesian space position and speed of the robot are obtained through a conversion model, and the specific formula is as follows:

x＝ψ(q),

where ψ (·) is the robot kinematics solution and J (q) is the Jacobian matrix.

2) Actual position x and speed

As an input item of the reinforcement learning algorithm, the reward r is also used as an input item of the reinforcement learning algorithm, the input state information is processed and then subjected to criticic network evaluation and Actor network output decision-making action, namely, an impedance parameter K, and a specific algorithm implementation process is shown in fig. 3.

Fig. 3 is a network architecture in the dual-delay depth deterministic policy gradient TD3 algorithm.

The present embodiment learns the impedance parameter using a double delay depth Deterministic Policy Gradient (TD 3) algorithm, wherein the state S includes the current position x and velocity of the end effector of the robot arm

And an action K output at the last moment; a represents the strategy output by the Actor network; a' represents the output of the Actor target networkA policy; q1 represents the computed value function evaluated by Critic network 1; q2 represents the computed value function evaluated by Critic network 2; q' represents a target value function; r represents an instant reward; TD _ error1 represents the error of the weighted sum of R and Q' subtracted from Q1; TD _ error2 represents the error of the weighted sum of R and Q' subtracted from Q2; target represents the weighted sum of R and Q'; the difference between the Actor network and the Actor target network is that the Actor network is updated in an experience pool at each step, and the Actor target network copies network parameters of the Actor network to the Actor target network at intervals to realize the update of the Actor target network; the Critic network 1 and the Critic network 2 respectively and independently update network parameters by using the same target value function; the difference between the criticic network 1 and the criticic target network 1 is that the criticic network 1 is updated in an experience pool at each step, and the criticic target network 1 copies the network parameters of the criticic network 1 into the criticic target network 1 at intervals to realize the updating of the criticic target network 1. The reward function defined in this embodiment is:

r＝-a*||(F_e-F_d)²||-b*||(x-x_d)²||-c*||(x-x_obj)²||+r_final

wherein ,

indicates the target position, r_finalIs a positive integer.

The return function comprises four items in total, the first three items represent the instant return of each step and are respectively used for punishing the behaviors of generating large interaction force, deviating from an expected track and being far away from a target position, and the last item represents that a task is completed within a specified time, namely, a reward is given when the target position is reached. The purpose of this reward function is to encourage the act of moving towards the aperture, while inhibiting the act of creating a large interaction.

3) And in addition, the actual position x and the speed of the mechanical arm end effector

As input to the environment module, an environmental interaction force F is calculated_e(ii) a The environment interaction force is designed as follows:

wherein ,

respectively damping, rigidity and inertia diagonal matrix of the robot admittance model,

in order to expect the interaction force,

respectively cartesian spatial position, velocity and acceleration of the robot,

4) The impedance parameter K and the environment interaction force F_eAs input to the admittance model, a reference position x is calculated from the admittance model_rAnd a reference speed

And serves as an input to the controller. The admittance model is as follows:

wherein ,

respectively a reference position, a reference speed and a reference acceleration of the Cartesian space constrained by the robot after the robot is subjected to an external force. Calculating reference position x according to admittance model_rAnd a reference speed

The specific steps are that the solution is carried out by an integral method, and the last time t-tau is used_sReference speed of

And a reference position x_r(t-τ_s) Calculating the reference acceleration of the current time t

Then, the reference speed of the current moment is obtained through integral operation

And a reference position x_r(t), the formula is as follows:

the final purpose of the scheme is to learn the impedance parameters of the robot in the task execution process in a reinforcement learning mode, enhance the compliance effect of the robot and guarantee the safety of the robot and the environment. After hundreds of rounds of training, the learning curve is converged, and the optimal motion trajectory and environment interaction force change curve of the robot can be obtained through simulation, so that the whole task is completed.

Fig. 2 is a schematic diagram of robot axis hole assembly, and an axis hole assembly task of the robot is taken as an environmental background. The initial position of the tail end of the mechanical arm is positioned at x₀The desired trajectory is first from x₀To x₁Then from x₁To x₂When the mechanical arm deviates from the expected track, the mechanical arm is subjected to the environmental interaction force generated by the collision with the wall, and the mechanical arm is produced by adjusting the impedance parameterThe effect of softening is generated, and the environmental interaction force is reduced.

FIG. 4 is a schematic diagram of a constrained Markov decision process framework, which is combined with a constrained Markov decision process algorithm to implement security reinforcement learning.

Compared with the general Markov decision process, the invention additionally introduces a loss function c in the constraint Markov decision process and sets a constraint threshold value d. Order to

For long-term loss of the policy pi within the constraints, the set of feasible solutions is then defined as |, c ═ pi ∈ pi:, J_C(π)D is less than or equal to. Then, under the condition of meeting the constraint condition, searching a strategy pi for maximizing the long-term income^*＝argmax_π∈∏cJ (. pi.). Depending on the specific task, the loss function is designed as c ═ w | (F)_e-F_d)²Where w is a parameter for adjusting the weight, the purpose of designing the loss function is to constrain the environmental interaction force to a safe range, and then to implement the learning of the impedance parameter within the safe range.

In conclusion, the TD3 algorithm is applied to the impedance learning of the Panda robot for the first time, so that the feasibility of impedance learning of the Panda robot based on a reinforcement learning mode and further variable admittance control is verified; the invention also applies the safety reinforcement learning idea to impedance learning, combines the CMDP and the TD3 algorithm, and applies the combination to the impedance learning task in the robot shaft hole assembling process, thereby ensuring the safety of the shaft hole assembling task.

Compared with the prior art, the invention has the following advantages:

firstly, impedance learning based on a TD3 algorithm is applied to a Panda robot simulation platform for the first time, and feasibility of impedance learning of the Panda robot based on a reinforcement learning mode and then variable admittance control is verified; secondly, the deep reinforcement learning (namely TD3) algorithm has higher performance, is more stable, learns the optimal impedance parameter more quickly and is more suitable for being applied to an impedance learning task in the assembly process of the shaft hole of the robot; the safety reinforcement learning idea is introduced, and the CMDP is combined, so that the safety guarantee effect is achieved, and the robot shaft hole equipment process is safer.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A robot impedance learning method based on safety reinforcement learning is characterized by comprising the following steps:

2. The robot impedance learning method based on safety reinforcement learning of claim 1, wherein the position information and the speed information of the cartesian space of the robot are calculated according to a robot dynamics equation, specifically:

wherein the robot dynamics equation has the expression:

wherein ,

is the joint moment;

is a joint space inertia matrix;

is a Coriolis force and centripetal force coupling matrix;

is a gravity term;

a non-rigid body power term such as friction;

applying a moment to the environment in the joint space;

the robot joint space position and speed information;

x＝ψ(q)；

the calculation formula of the speed information is as follows:

wherein ,

representing location information;

3. The robot impedance learning method based on safety reinforcement learning of claim 1, wherein the determining a decision action according to the input item and further determining an impedance parameter comprises:

processing the state information of the input item;

taking the decision action as an impedance parameter.

4. The robot impedance learning method based on safety reinforcement learning of claim 1, wherein the calculation formula of the environment interaction force is as follows:

wherein ,

representing environmental interaction forces;

is the desired interaction force;

5. The robot impedance learning method based on safety reinforcement learning of claim 4, wherein the expression of the admittance model is as follows:

wherein ,

6. The robot impedance learning method based on safety reinforcement learning of claim 3, wherein the evaluating and optimizing the processed input items through a Critic network and an Actor network to obtain a decision action comprises:

7. The robot impedance learning method based on safety reinforcement learning as claimed in claim 6, further comprising the following steps: and safety reinforcement learning is carried out by combining a constraint Markov decision process algorithm, and the steps specifically comprise:

defining a set of feasible solutions according to the constraint threshold;

searching an optimal strategy according to the set of feasible solutions;

optimizing and adjusting the loss function according to the actual task;

8. A robot impedance learning device based on safety reinforcement learning is characterized by comprising:

9. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program realizes the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method according to any one of claims 1 to 7.