CN115674204A

CN115674204A - Robot shaft hole assembling method based on deep reinforcement learning and admittance control

Info

Publication number: CN115674204A
Application number: CN202211369853.3A
Authority: CN
Inventors: 李明富; 邓旭康; 谭雅斌; 姜瑞森; 张黎明; 刘振宇
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2022-11-03
Filing date: 2022-11-03
Publication date: 2023-02-03

Abstract

The invention provides a robot shaft hole assembling method based on deep reinforcement learning and admittance control, belonging to the field of intelligent manufacturing; aiming at a single-shaft hole searching stage and a multi-shaft hole searching stage, a deep reinforcement learning network I and a deep reinforcement learning network II are respectively designed to train the output of the hole searching action, and meanwhile, the knowledge of searching Kong Xianyan is added in the process, so that the exploration training efficiency of the deep reinforcement learning network is improved; aiming at a single-shaft hole jack stage and a multi-shaft Kong Chakong stage, a variable admittance control method based on a deep reinforcement learning network III is designed to realize smooth jack of the robot, meanwhile, jack priori knowledge is added, if the state is a priori state, the robot is directly output to adjust the posture, and if the state is not the priori state, the robot is input to the deep reinforcement learning network III to train, so that the jack efficiency of the robot is improved; by adopting the technical scheme of the invention, the robot can be controlled to complete the assembly of a single shaft hole and a plurality of shaft holes by training the deep reinforcement learning network and outputting the adjustment strategy.

Description

Robot shaft hole assembling method based on deep reinforcement learning and admittance control

Technical Field

The invention relates to the field of intelligent manufacturing, in particular to a robot assembly method based on deep reinforcement learning and admittance control.

Background

With the development of intelligent manufacturing technology, the flexible assembly of the robot is always a research focus and a hot spot in the field of industrial robots, but high-performance assembly still faces many difficulties, and the research on the assembly method of the flexible shaft hole of the robot has great significance for the expansion of the robot in the whole assembly field. In the field of assembling flexible shaft holes of robots, the main methods are three types: the method comprises the steps of (1) an assembly method based on passive compliance control, a compliance assembly method based on a traditional control algorithm, and a robot compliance assembly method based on a learning algorithm; both the passive compliance control-based assembly method and the conventional control algorithm-based compliance assembly method need to perform complex analysis or dynamic modeling on an assembled contact model, and the dynamic models of some parts are often complex, which brings great difficulty to the design of the assembly algorithm; the robot flexible assembly method based on the learning algorithm usually needs a large amount of training data, the learning efficiency is low, and the assembly robustness is not high. Secondly, for multi-shaft hole assembly, the traditional control algorithm at the hole searching and jack inserting stage in the assembly process has poor adaptability to different assembly environments and low efficiency, so that in order to solve the defects, the robot assembly method based on deep reinforcement learning and admittance control is provided.

Disclosure of Invention

In order to solve the above problems, the present invention aims to disclose a robot assembly method based on deep reinforcement learning and admittance control, and particularly to a method for implementing flexible assembly of a robot by exploring and learning the actions of hole searching and jack adjusting strategies in the shaft hole assembly process based on a deep reinforcement learning algorithm.

In order to achieve the above purpose, the technical scheme adopted by the invention mainly comprises the following processes:

step 1, assembling a shaft hole of a robot into three stages: a free hole-approaching stage outside a hole, a hole-searching stage at the edge of the hole and a hole-inserting stage inside the hole; meanwhile, distinguishing methods of single-shaft hole assembly and multi-shaft hole assembly of the robot;

step 2, roughly positioning the hole in a visual positioning mode, controlling the robot to move, enabling the tail end workpiece of the robot to reach the position of the hole edge and contact the hole, and enabling the robot to enter a hole searching stage of the assembled hole edge;

in the visual positioning method, a robot world coordinate system, a tool coordinate system and a workpiece coordinate system are constructed, wherein the origin of the world coordinate system is the origin of a robot base coordinate system; secondly, placing markers around the assembly holes, using a depth camera to obtain RGB images and depth information of the markers, and realizing coarse positioning of the assembly holes by calibrating the camera and the depth images, and combining a mechanical arm to perform operations such as hand-eye calibration, marker identification and the like; coordinates obtained by rough positioning are the position and the posture of the assembly hole in a robot world coordinate system;

preferably, the control of the robot movement means controlling the robot movement by a program;

step 3, in the hole edge searching stage, firstly designing a search Kong Xianyan knowledge based on human hole searching experience, presetting a priori state and a priori action to form a priori knowledge memory bank, secondly, collecting force/moment information and pose information of the robot as input of a deep reinforcement learning network, judging whether the input is the priori knowledge or not through a priori state classification model, if so, directly outputting corresponding action, and if not, inputting the input to the deep reinforcement learning network for exploration and training; outputting the adjustment strategy action of the position or posture of the robot, so that the workpiece at the tail end of the robot can smoothly search for the hole; respectively constructing a deep reinforcement learning network I aiming at single-shaft hole searching and a deep reinforcement learning network II aiming at multi-shaft hole assembling;

the search Kong Xianyan knowledge prior actions based on the human hole search experience comprise actions A = [ + DeltaX, -DeltaX, + DeltaX, -DeltaX ], deltaX and DeltaY in four directions, wherein the actions A = [ + DeltaX, -DeltaX, + DeltaX, -DeltaX ], deltaX and DeltaY are respectively step displacement of a workpiece at the tail end of the robot in the directions of an X axis and a Y axis under a Cartesian world coordinate system, and after the workpiece is classified by a prior state classification model, one prior action in the four actions is output, the position of the tail end of the robot is adjusted, and finally the robot can smoothly search holes;

input S = [ P ] of search Kong Xianyan knowledge prior state classification model based on human hole search experience _X ，P _Y ，R _X ，R _Y ，R _Z ]In which P is _X ，P _Y The position of a robot tail end workpiece in the X-axis and Y-axis directions under a Cartesian world coordinate system, R _X ，R _Y ，R _z The meaning of adding the hole searching priori knowledge based on the human hole searching experience to the rotation angles of the robot tail end workpiece around the X-axis direction, the Y-axis direction and the Z-axis direction under the world coordinate system is that when the movement of the robot tail end workpiece needing to be adjusted is taken as the movement amount of the X-axis direction and the Y-axis direction on a hole plane, the movement does not need to be adjusted through a depth reinforcement learning network, but the movement is adjusted according to the preset movement of the human experience, and further the exploration efficiency of the depth reinforcement learning in the hole searching process is accelerated;

preferably, the prior state classification model adopts a Support Vector Machine (SVM) model;

the force signals are acquired through a robot tail end force sensor and comprise contact force signals F of a robot tail end workpiece in the directions of an X axis, a Y axis and a Z axis of three coordinate axes _X ，F _Y ，F _Z Assembly torque signal T around the end piece of the robot _X ，T _Y ，T _Z ；

The input S = [ P ] of the deep reinforcement learning network I aiming at single-axis hole searching _X ，P _Y ，R _X ，R _Y ，R _Z ，F _X ，F _Y ，T _X ，T _Y ]The input of the deep reinforcement learning network II for the multi-axial hole assembly is S = [ P = _X ，P _Y ，R _X ，R _Y ，R _z ，F _X ，F _Y ，T _X ，T _Y ，T _Z ]In which P is _X ，P _Y ，R _X ，R _Y ，R _z The position information of the end workpiece in X and Y axes under Cartesian world coordinates, the rotation amount around X, Y and Z directions, F _X ，F _Y ，T _X ，T _Y ，T _Z Force and moment information of robot end workpiece in X axis and Y axis and moment of robot end workpiece in Z axisInformation; output A = [ + Delta R of deep reinforcement learning network I _X ，-ΔR _X ，+ΔR _Y ，-ΔR _Y ]The output A = [ + Delta R of the deep reinforcement learning network II _X ，-ΔR _X ，+ΔR _Y ，-ΔR _Y ，+ΔR _Z ，-ΔR _z ，]Wherein Δ R _X ，ΔR _Y ，ΔR _Z Respectively the stepping adjustment amount of the workpiece at the tail end of the robot in the positive and negative directions around an X axis, the positive and negative directions around a Y axis and the positive and negative directions around a Z axis;

preferably, the Deep reinforcement learning Network I and the Deep reinforcement learning Network II both adopt DQN (Deep Q Network) Network structures, the input of a DQN model is subjected to nonlinear transformation of three convolution layers and two full-connection layers, finally, a Q value of each action value is obtained on an output layer, and an optimal action is selected according to the Q value; the updating mode of the DQN algorithm is as follows:

wherein s 'and a' represent the state and action of the next moment, respectively, and r and γ represent the reward function and the decay factor, respectively; the loss function of the DQN algorithm is

Step 4, designing a reward and punishment function, training round number and maximum step number of each round training aiming at a deep reinforcement learning network of a hole edge hole searching stage; respectively exploring and training the single axis Kong Soukong and the multi-axis hole searching by using a deep reinforcement learning network I and a deep reinforcement learning network II until the step number of each round is converged and the reward value is also converged;

the reward function

Wherein k is _n For the number of steps of the training round n, the maximum number of steps in each round is set as k _max The first category of the reward functions is positive reward, i.e. passing the current roundThe ratio of the training step number to the set maximum step number is determined, and the more steps are taken when the assembly is completed in each round, the lower the forward reward value obtained in the round is; the second type is negative value reward, when the robot end rotates for m times along the same direction continuously, the behavior is set to be the behavior deviating from the hole center, and-1 reward is given;

the exploration training process comprises the steps of firstly controlling the tail end of the robot to move towards the direction of a hole center according to a roughly positioned hole position posture, after initial movement, collecting force/moment information and position posture information of a workpiece at the tail end of the robot, judging whether the state is a priori state or not according to parameters in the state, if the state is the priori state, executing next action according to the priori action, if the state is not the priori state, inputting the state information into deep reinforcement learning network training to obtain the motion state of the next step, repeating the steps until hole searching is successful or the training step number of each round reaches the set maximum step number, finishing the training of the round, calculating a punishment value, and then carrying out the training of the next round; when the acquired Z-axis force is suddenly reduced, the hole searching success in the round is indicated; when the optimal action strategy is trained, the reward function and the training steps are converged to show that the training is successful, and a workpiece at the tail end of the robot enters a hole inserting stage;

step 5, in the hole jack stage, setting a constant expected force in the Z-axis direction at the tail end of the mechanical arm, enabling the force and the moment in the other directions to be 0, enabling the tail end of the robot to move to a set expected position, acquiring the tail end force/moment information of the robot in the jack process, and comparing the acquired tail end force/moment of the robot with the set expected force to obtain real-time force deviation; inputting the force deviation into an admittance controller, calculating by an admittance control algorithm, and outputting the adjustment posture of the tail end workpiece of the robot; meanwhile, the admittance control parameters are optimized in real time through a deep reinforcement learning algorithm, and variable admittance control is realized;

said desired force F _d ＝[0，0，f _d ，0，0，0]The sensor acquires contact force F = [ F = _x ，f _y ，f _z ，T _x ，T _y ，T _z ]Force deviation of Δ F = F _d -F；

The admittance control algorithm is as follows:

wherein M, B, K represents robot mass matrix, stiffness matrix, damping matrix, X respectively _d The expected pose of the robot tail end workpiece in a Cartesian world coordinate system is shown, and X is the real-time pose of the robot tail end workpiece in the Cartesian world coordinate system;

step 6, designing a jack priori knowledge memory base based on human jack experience, wherein the memory base comprises a priori state and a priori experience value, discriminating the input state of the deep reinforcement learning network by constructing a classifier, and if the input state is the priori state, directly outputting the priori experience value and then directly controlling the tail end of the robot workpiece to move; if the input state is not the prior state, the input state is sent into a deep reinforcement learning network for training, and then the motion of the tail end of the robot is controlled through an admittance controller; defining admittance control parameters to be trained, simulating an admittance control algorithm, selecting different admittance control parameters for simulation, determining the admittance control parameters to be trained, and predetermining the variation range of the admittance control parameters;

the input of the jack prior state based on human jack experience is the position and the attitude S = [ Px, py, pz, rx, ry, rz, fx, fy, fz, tx, ty, tz ] of the robot end workpiece in a Cartesian world coordinate system, wherein Px, py, pz are the positions of the robot end workpiece in the X axis, the Y axis and the Z axis in the Cartesian world coordinate system respectively, rx, ry, rz are the rotation amounts of the robot end workpiece around the X axis, the Y axis and the Z axis in the Cartesian world coordinate system respectively, and Fx, fy, fz and Tx, ty, tz are the forces and the moments of the robot end in the X axis, the Y axis and the Z axis directions respectively; the prior experience value of the jack based on the experience of the human jack is A = [ + delta Rx, -delta Rx, + delta Ry, -delta Ry ], wherein the delta Rx and the delta Ry respectively represent positive and negative rotation quantity of the workpiece at the tail end of the robot around an X axis and positive and negative rotation quantity around a Y axis;

the admittance control parameters comprise an inertia matrix, a damping matrix and a damping matrix, wherein the inertia matrix is a unit positive definite matrix; the training mode is to select one coefficient component in the damping matrix and one coefficient component in the damping matrix to train, and further obtain other coefficients in the damping matrix and the stiffness matrix according to the proportional relation;

step 7, constructing a deep reinforcement learning network III which is used for variable admittance control and can output continuous actions, collecting force/moment signals and pose signals in the jack inserting process as state input of the deep reinforcement learning, wherein the output of the deep reinforcement learning network is a parameter of admittance control; designing a reward and punishment function of the deep reinforcement learning network III, training round number and maximum step number of each training round;

the input and output of the deep reinforcement learning network III are that, for the single-axis hole jack stage, the input S = [ Px, py, pz, rx, ry, fx, fy, fz, tx, ty ] of the deep reinforcement learning network III is set, and for the multi-axis hole jack stage, the input S = [ Px, py, pz, rx, ry, rz, fx, fy, fz, tx, ty, tz ] of the deep reinforcement learning network III is set; wherein Px, py and Pz are respectively the positions of the robot end workpiece on the X axis, the Y axis and the Z axis in a Cartesian world coordinate system, rx, ry and Rz are respectively the rotation values of the robot end workpiece around the X axis, the Y axis and the Z axis in the Cartesian world coordinate system, and Fx, fy and Fz and Tx, ty and Tz are respectively the force and moment of the robot end in the X axis, the Y axis and the Z axis directions; after training through a reinforcement learning network III, outputting continuous rigidity coefficients and damping coefficients, wherein the rigidity coefficients and the damping coefficients are values in a certain interval;

preferably, the Deep reinforcement learning network III structure selects a DDPG (Deep dependent Policy Gradient) network structure, the network is divided into a critic module and an actor module, and the network comprises four neural networks; and the critic module updates the network parameter omega in a time sequence difference error mode and periodically copies omega to the target network. The actor module updates a network parameter theta in a DPG algorithm mode, and the action strategy selects an action to act on the environment according to an output result of a strategy network;

the above-mentionedIn, the reward function is

Wherein k is _n Number of steps at the end of the nth round, k _max Setting the maximum training step number, wherein delta H is the depth of the jack at the end of each round, and H is the set depth of the completed assembly; the whole reward function is divided into three parts, wherein the first part gives forward rewards according to the ratio of the number of steps of each assembly round to the set maximum number of steps, and when the assembly round is finished, the more steps are used, the less forward rewards are obtained; the second category is determined according to the depth of each round of jacks, and within a reasonable threshold value, the deeper the depth of the jacks, the higher the forward reward; finally, giving a-1 reward means that the force exceeds a set threshold during the insertion process;

and 8, in the stage of inserting holes, setting the tail end of the robot to reach a certain inserting hole depth and each axial force/moment to represent that the inserting holes are successful within a certain threshold range.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clear, the following detailed description of the embodiments of the present invention:

step 1, assembling a shaft hole of a robot into three stages: a free hole-approaching stage outside a hole, a hole-searching stage at the edge of the hole and a hole-inserting stage inside the hole; meanwhile, methods of single-shaft hole assembly and multi-shaft hole assembly of the robot are distinguished, and experiments are carried out on the two methods;

in the visual positioning method, a robot world coordinate system, a tool coordinate system and a workpiece coordinate system are constructed, wherein the origin of the world coordinate system is the origin of a robot base coordinate system; secondly, placing markers around the assembly holes, using a depth camera to obtain RGB images and depth information of the markers, and realizing coarse positioning of the assembly holes by calibrating the camera and the depth images, and combining a mechanical arm to perform operations such as hand-eye calibration, marker identification and the like; coordinates obtained by rough positioning are the positions and postures of the assembly holes in a robot world coordinate system;

the control of the robot movement means that the robot movement is controlled by a program, and the movement of a workpiece at the tail end of the robot is realized by using a moveit planning track in the ROS;

step 3, in the hole edge searching stage, firstly designing a search Kong Xianyan knowledge based on human hole searching experience, presetting a priori state and a priori action to form a priori knowledge memory bank, secondly, collecting force/moment information and pose information of the robot as input of a deep reinforcement learning network, judging whether the input is the priori knowledge or not through a priori state classification model, if so, directly outputting corresponding action, and if not, inputting the input to the deep reinforcement learning network for exploration and training; outputting the adjustment strategy action of the position or the posture of the robot, so that the workpiece at the tail end of the robot can smoothly search holes; respectively constructing a deep reinforcement learning network I aiming at single-shaft hole searching and a deep reinforcement learning network II aiming at multi-shaft hole assembling;

the search Kong Xianyan knowledge prior actions based on the human hole search experience comprise actions A = [ +0.10, -0.10, +0.10, -0.10] in four directions, step displacement of a workpiece at the tail end of the robot in the directions of an X axis and a Y axis under a Cartesian world coordinate system is respectively represented, after classification is carried out through a prior state classification model, one prior action in the four actions is output, the position of the tail end of the robot is adjusted, and finally the robot can smoothly search holes;

input S = [ P ] of the human hole searching experience-based searching Kong Xianyan knowledge prior state classification model _X ，P _Y ，R _X ，R _Y ，R _Z ]In which P is _X ，P _Y The position of a robot tail end workpiece in the X-axis and Y-axis directions under a Cartesian world coordinate system, R _X ，R _Y ，R _Z The meaning of adding the hole searching priori knowledge based on the human hole searching experience to the rotation angles of the robot tail end workpiece around the X-axis direction, the Y-axis direction and the Z-axis direction under the world coordinate system is that the motion of the robot tail end workpiece needing to be adjusted is taken as the motion of the X-axis direction and the Y-axis direction on a hole planeWhen the hole is moved, the learning action is not required to be explored through a deep reinforcement learning network, but the adjustment is carried out according to the action preset by human experience, so that the exploration efficiency of the deep reinforcement learning in the hole searching process is accelerated;

the prior state classification model adopts a Support Vector Machine (SVM) model;

The input S = [ P ] of the deep reinforcement learning network I aiming at single-axis hole searching _X ，P _Y ，R _X ，R _Y ，R _Z ，F _X ，F _Y ，T _X ，T _Y ]The input of the deep reinforcement learning network II for the multi-axial hole assembly is S = [ P = _X ，P _Y ，R _X ，R _Y ，R _Z ，F _X ，F _Y ，T _X ，T _Y ，T _Z ]In which P is _X ，P _Y ，R _X ，R _Y ，R _Z The position information of the end workpiece in X and Y axes under Cartesian world coordinates, the rotation amount around X, Y and Z directions, F _X ，F _Y ，T _X ，T _Y ，T _Z Force and moment information of a robot tail end workpiece on an X axis and a Y axis and moment information of the robot tail end workpiece on a Z axis are obtained; the output A = [ +0.05, -0.05, +0.05, -0.05 of the deep reinforcement learning network I]The output a = [ +0.05, -0.05, +0.05, -0.05 of the deep reinforcement learning network II]The step adjustment quantity of the workpiece at the tail end of the robot in the positive and negative directions around an X axis, the positive and negative directions around a Y axis and the positive and negative directions around a Z axis respectively;

the Deep reinforcement learning Network I and the Deep reinforcement learning Network II both adopt DQN (Deep Q Network) Network structures, the input of a DQN model is subjected to nonlinear transformation of three convolution layers and two full-connection layers, the Q value of each action value is finally obtained on an output layer, and the optimal action is selected according to the Q valueMaking; the updating mode of the DQN algorithm is as follows:

the reward function

Wherein k is _n For the number of steps of the training round n, the maximum number of steps in each round is set as k _max =40, the number of rounds is Episode =100, and if the training steps and the reward function are not converged after the number of rounds is exceeded, the training is determined to be failed; the first type in the reward function is positive reward, namely the positive reward is determined by the ratio of the number of steps trained in the current round to the set maximum number of steps, and the more steps are taken when the assembly is completed in each round, the lower the positive reward value obtained in the round is; the second type is a negative value reward, when the robot end rotates for 6 times along the same direction continuously, the behavior is set to be an off-hole center behavior, and a-1 reward is given;

the exploration training process comprises the steps of firstly controlling the tail end of the robot to move towards the direction of a hole center according to a roughly positioned hole position posture, after initial movement, collecting force/moment information and position posture information of a workpiece at the tail end of the robot, judging whether the state is a priori state or not according to parameters in the state, if the state is the priori state, executing next action according to the priori action, if the state is not the priori state, inputting the state information into deep reinforcement learning network training to obtain the motion state of the next step, repeating the steps until hole searching is successful or the training step number of each round reaches the set maximum step number, finishing the training of the round, calculating a punishment value, and then carrying out the training of the next round; when the acquired Z-axis force is suddenly reduced, the hole searching in the round is successful; when the optimal action strategy is trained, the reward function and the training steps are converged to indicate that the training is successful, and a workpiece at the tail end of the robot enters a hole to be inserted;

step 5, in the hole jack stage, setting a constant expected force in the Z-axis direction at the tail end of the mechanical arm, enabling the force and the moment in the other directions to be 0, enabling the tail end of the robot to move to a set expected position, acquiring the tail end force/moment information of the robot in the jack process, and comparing the acquired tail end force/moment of the robot with the set expected force to obtain real-time force deviation; inputting the force deviation into an admittance controller, calculating by an admittance control algorithm, and outputting the adjustment posture of the workpiece at the tail end of the robot; meanwhile, the admittance control parameters are optimized in real time through a deep reinforcement learning algorithm, so that variable admittance control is realized;

The admittance control algorithm is as follows:

the input of the jack prior state based on human jack experience is the position and the attitude S = [ Px, py, pz, rx, ry, rz, fx, fy, fz, tx, ty, tz ] of the robot end workpiece in a Cartesian world coordinate system, wherein Px, py, pz are the positions of the robot end workpiece in the X axis, the Y axis and the Z axis in the Cartesian world coordinate system respectively, rx, ry, rz are the rotation amounts of the robot end workpiece around the X axis, the Y axis and the Z axis in the Cartesian world coordinate system respectively, and Fx, fy, fz and Tx, ty, tz are the forces and the moments of the robot end in the X axis, the Y axis and the Z axis directions respectively; the prior experience value of the jack based on human jack experience is A = [ +0.05, -0.05, +0.05, -0.05], and the first two parameters and the last two parameters in A are respectively expressed as positive and negative rotation quantity of the robot end workpiece around an X axis and positive and negative rotation quantity around a Y axis in a Cartesian coordinate system;

the input and output of the deep reinforcement network III are that aiming at the single-axis hole jack stage, the input S = [ Px, py, pz, rx, ry, FX and F ] of the deep reinforcement learning network III is set _y 、Fz、Tx、Ty]Input S = [ Px, py, pz, rx, ry, rz, fx, fy, fz, tx, ty, tz for the multi-axial hole jack stage deep reinforcement learning network III](ii) a Wherein Px, py and Pz are respectively the positions of the robot end workpiece on the X axis, the Y axis and the Z axis in a Cartesian world coordinate system, rx, ry and Rz are respectively the rotation values of the robot end workpiece around the X axis, the Y axis and the Z axis in the Cartesian world coordinate system, and Fx, fy and Fz and Tx, ty and Tz are respectively the force and moment of the robot end in the X axis, the Y axis and the Z axis directions; outputting continuous stiffness coefficients and damping coefficients after training through a reinforcement learning network III, wherein the stiffness coefficients and the damping coefficients are values in a certain interval, a stiffness coefficient training action value range b belongs to (0.5,5), and a damping coefficient training action value k belongs to (5, 75);

the Deep reinforcement learning network III structure selects a DDPG (Deep dependent Policy Gradient) network structure, the network is divided into a critic module and an actor module, and the network comprises four neural networks; and the critic module updates the network parameter omega in a time sequence difference error mode and periodically copies omega to a target network. The actor module updates a network parameter theta in a DPG algorithm mode, and the action strategy selects an action to act on the environment according to an output result of the strategy network;

wherein the reward function is

Wherein k is _n Number of steps at the end of the nth round, k _max =60 is the set maximum number of training steps, number of rounds Episode =120; delta H is the depth of the jack at the end of each round, and H is the set depth of the completed assembly; the whole reward function is divided into three parts, the first part gives positive reward according to the ratio of the step number of each round assembly to the set maximum step number, and the positive reward is finishedWhen the assembly round is finished, the more steps are used, the less positive rewards are obtained; the second category is determined according to the depth of each round of jacks, and within a reasonable threshold, the deeper the depth of the jacks, the higher the positive reward; finally, giving a-1 reward means that the force exceeds a set threshold during the insertion process;

Claims

1. A robot shaft hole assembling method based on deep reinforcement learning and admittance control is characterized by comprising the following steps:

step 1, assembling the shaft hole of the robot into three stages: a free hole-approaching stage outside a hole, a hole-searching stage at the edge of the hole and a hole-inserting stage inside the hole; meanwhile, distinguishing methods of single-shaft hole assembly and multi-shaft hole assembly of the robot;

step 2, roughly positioning the assembly hole in a visual positioning mode, and controlling the robot to move so that a workpiece at the tail end of the robot reaches the position of the hole edge, and the robot enters the hole searching stage of the assembly hole edge;

step 3, in the hole edge searching stage, firstly designing a search Kong Xianyan knowledge based on human hole searching experience, presetting a priori state and a priori action to form a priori knowledge memory bank, secondly, collecting force/moment information and pose information of the robot as input of a deep reinforcement learning network, judging whether the input is the priori knowledge or not through a priori state classification model, if so, directly outputting corresponding action, otherwise, inputting the input to the deep reinforcement learning network for exploration and training, and outputting an adjustment strategy action of the position or the posture of the robot, so that a workpiece at the tail end of the robot can smoothly search holes; respectively constructing a deep reinforcement learning network I aiming at single-shaft hole searching and a deep reinforcement learning network II aiming at multi-shaft hole assembling;

step 5, in the hole jack stage, setting a constant expected force in the Z-axis direction at the tail end of the mechanical arm, enabling the force and the moment in the other directions to be 0, enabling the tail end of the robot to move towards a set expected position, acquiring the tail end force/moment information of the robot in the jack process, and comparing the acquired tail end force/moment of the robot with the set expected force to obtain real-time force deviation; inputting the force deviation into an admittance controller, calculating by an admittance control algorithm, and outputting the adjustment posture of the workpiece at the tail end of the robot; meanwhile, the admittance control parameters are optimized in real time through a deep reinforcement learning algorithm, so that variable admittance control is realized;

step 6, designing a jack priori knowledge memory base based on human jack experience, wherein the memory base comprises a priori state and a priori experience value, discriminating the input state of the deep reinforcement learning network by constructing a classifier, and if the input state is the priori state, directly outputting the corresponding priori experience value and then directly controlling the tail end of the robot workpiece to move; if the input state is not the prior state, the input state is sent into a deep reinforcement learning network for training to obtain admittance control parameters; defining admittance control parameters to be trained, simulating an admittance control algorithm, selecting different admittance control parameters for simulation, determining the admittance control parameters to be trained, and predetermining the variation range of the admittance control parameters;

2. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: in the step 3, the prior motion based on the human hole searching experience is used as the stepping displacement of the robot end workpiece in the directions of the X axis and the Y axis in the Cartesian world coordinate system, and the input of the prior state classification model comprises the positions of the robot end workpiece in the directions of the X axis and the Y axis in the Cartesian world coordinate system and the rotation angles around the directions of the X axis, the Y axis and the Z axis.

3. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: in the step 3, the input of the deep reinforcement learning network I is the position information of the robot end workpiece in the X and Y axes under the Cartesian world coordinate, the rotation amount around the X axis, the Y axis and the Z axis, and the force and moment information of the robot end workpiece in the X axis and the Y axis; the output comprises four actions, namely the stepping adjustment quantity of the workpiece at the tail end of the robot in the positive and negative directions around the X axis and the positive and negative directions around the Y axis.

4. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: in the step 3, the input of the deep reinforcement learning network II is the positions of the robot tail end workpiece on the X axis and the Y axis under the Cartesian world coordinate, the rotation angles around the X axis, the Y axis and the Z axis, the forces of the robot tail end workpiece on the X axis and the Y axis and the moments of the robot tail end workpiece on the X axis, the Y axis and the Z axis; the output of the deep reinforcement learning network II is six actions which are respectively the positive and negative stepping adjustment quantity of the rotation of the workpiece at the tail end of the robot around the X axis, the Y axis and the Z axis.

5. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: in the step 4, the reward and punishment function is totally divided into two types of rewards, the first type is positive reward, namely the positive reward is determined by the ratio of the step number of the training in the current round to the set maximum step number, and the more the step number is taken when the assembly is completed in each round, the lower the positive reward value obtained in the round is; the second type is negative value reward, when the robot end rotates for m times along the same direction continuously, the behavior is set to be the behavior deviating from the hole center, and-1 reward is given; when the collected Z-axis force is suddenly reduced, the hole searching in the round is successful, and the hole searching in the round is finished.

6. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: in the step 6, the input of the prior state classifier is the position and the posture of the robot tail end workpiece under a Cartesian world coordinate system and the force/moment information acting on the robot tail end workpiece; the prior experience value is positive and negative rotation quantity of the workpiece at the tail end of the robot around an X axis and positive and negative rotation quantity around a Y axis.

7. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: and 6, selecting one coefficient component in the damping matrix and one coefficient component in the damping matrix for training, and further obtaining other coefficients in the damping matrix and the stiffness matrix according to the proportional relation.

8. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: in the step 7, the input S = [ Px, py, pz, rx, ry, fx, fy, fz, tx, ty ] of the deep reinforcement learning network III is set for the single-axis hole jack stage, and the input S = [ Px, py, pz, rx, ry, rz, fx, fy, fz, tx, ty, tz ] of the deep reinforcement learning network III for the multi-axis hole jack stage is set; wherein Px, py and Pz are respectively the positions of the robot end workpiece on the X axis, the Y axis and the Z axis in a Cartesian world coordinate system, rx, ry and Rz are respectively the rotation values of the robot end workpiece around the X axis, the Y axis and the Z axis in the Cartesian world coordinate system, and Fx, fy and Fz and Tx, ty and Tz are respectively the force and moment of the robot end in the X axis, the Y axis and the Z axis directions; and after training through a reinforcement learning network III, outputting continuous rigidity coefficients and damping coefficients, wherein the rigidity coefficients and the damping coefficients are values in a certain interval.

9. The robot axis hole assembling method based on the deep reinforcement learning and admittance control as recited in claim 1, wherein: in the step 7, the reward function is

Wherein k is _n Is the number of steps at the end of each round, k _max Setting the maximum training step number, wherein delta H is the depth of the jack at the end of each round, and H is the set depth of the completed assembly; the whole reward function is divided into three parts, wherein the first part gives forward rewards according to the ratio of the number of steps of each assembly round to the set maximum number of steps, and when the assembly round is finished, the more steps are used, the less forward rewards are obtained; the second category is determined according to the depth of each round of jacks, and within a reasonable threshold, the deeper the depth of the jacks, the higher the positive reward; finally, awarding a-1 prize means that the force exceeds a set threshold during the insertion process.