CN116533249A

CN116533249A - Mechanical arm control method based on deep reinforcement learning

Info

Publication number: CN116533249A
Application number: CN202310657299.7A
Authority: CN
Inventors: 袁庆霓; 齐建友; 高清扬; 陈启鹏; 吕健; 杜飞龙; 吴杨东; 蓝伟文
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-08-04

Abstract

The invention relates to a mechanical arm control method based on deep reinforcement learning, which comprises the following steps of designing a servo controller, building a mechanical arm visual servo system, and training a self-adaptive servo gain DDPG model: setting up a DDPG training environment, setting up DDPG training parameters and finishing DDPG model training; setting and acquiring an expected image when the mechanical arm grabs a target object, and extracting expected characteristics; acquiring a current target image and extracting current characteristics; estimating a depth value by adopting an extended Kalman filtering method, substituting the depth value into a generalized inverse matrix of a calculated image jacobian matrix, and simultaneously calculating expected characteristics and current characteristics to obtain an image characteristic error e; according to the current state of the visual servo system of the mechanical arm, a trained adaptive servo gain algorithm based on depth deterministic strategy gradient DDPG is adopted to determine a servo gain value; and servo control is carried out on the mechanical arm through a designed servo controller. The method has the characteristics of stronger stability, higher robustness and faster convergence speed.

Description

Mechanical arm control method based on deep reinforcement learning

Technical Field

The invention relates to the technical fields of robots, computer vision and artificial intelligence, in particular to a mechanical arm control method based on deep reinforcement learning.

Background

Visual servoing is a servo control method that uses visual information to quickly control the pose of a robot or camera to a desired position, which, in combination with a robot, can make it perform complex tasks in an unstructured environment. Accordingly, visual servoing is receiving increasing attention in various fields, and applications of visual servoing in many fields are pursuing high precision and robustness requirements, such as medical surgery, plant seedling cutting, lunar sample container clamping, automatic positioning of lifting cranes, rotorcraft robotic arm systems, etc. A system in which a camera is used as a vision sensor and the vision sensor is used in feedback is called a vision servo system. Visual servoing is divided into three different categories according to the way the control system is constituted: 1) Position-based visual servoing (PBVS), which is the extraction of visual features from an image for estimating the three-dimensional pose of a camera relative to a target, i.e., reconstructing a three-dimensional model using two-dimensional data; 2) Image-based visual servoing (IBVS), which is the determination of control inputs directly on a two-dimensional image plane based on image feature coordinate errors; 3) Hybrid Visual Servoing (HVS), i.e. a combination of three-dimensional PBVS and two-dimensional IBVS.

In practice, there is a feature loss problem due to the limited field of view (FOV) of the camera. Whether IBVS or PBVS, there is a problem of target loss, which results in visual servo failure if the tracked object is lost from the field of view of the camera. In the prior art, there are mainly three solutions, the first is to avoid feature loss by long-field trajectory planning or optimization. Researchers optimize feature trajectories in image frames by using MPC-based methods that take into account visibility constraints. Instead of optimizing the feature trajectories in the image, the method used by the researchers programs the path of the cameras in three-dimensional cartesian space to improve the servo efficiency while avoiding feature loss. Although the above methods are very robust due to the long-field trajectory optimization, they require heavy computation, limiting their own application scenarios, especially real-time tasks. The second approach, focusing on ensuring the next state visibility constraint, has designed a control law based on a potential function to push features back into the safe area when they enter the dangerous area, which, while not computationally burdened, requires manual careful design of the magnitude of the potential field and the shape of the potential function. The third strategy utilizes a reinforcement learning algorithm to autonomously learn the strategy to accomplish the servo task and avoid feature loss. As the functionality of reinforcement learning is increasingly mined, learning-based approaches have become increasingly popular in recent years. Researchers use Dyna-Q learning algorithms to design adaptive laws to adjust the gain of IBVS, which not only guarantees visibility constraints, but also improves servo efficiency. Although the RL algorithm has been currently attempted to avoid target feature loss, the most popular is found to be the Q-learning algorithm, which suffers from a number of drawbacks, mainly the fact that the action space considered by the algorithm is discrete. Unlike this, depth deterministic strategy gradient (DDPG) algorithms can directly handle continuous motion space and the application of DRLs in terms of robotic visual servo control remains to be mined. In short, the visual servo control system has the problems of slow servo convergence, poor precision, easy feature loss and the like.

Disclosure of Invention

The invention aims to overcome the defects and provide a mechanical arm control method based on deep reinforcement learning, which has stronger stability, higher robustness and higher convergence speed.

The invention discloses a mechanical arm control method based on deep reinforcement learning, which comprises the following steps:

s1: building a visual servo system of the mechanical arm, setting and acquiring an expected image shot by a camera when the mechanical arm grabs a target object, and extracting expected characteristics;

the visual servo system of the mechanical arm comprises a seven-degree-of-freedom mechanical arm and a controller, wherein a camera is integrated on an end effector of the mechanical arm:

s2: initializing the state of the mechanical arm and ensuring that the target object is in the field of view of the camera at the moment;

s3: acquiring a current target image and extracting current characteristics;

s4: estimating a depth value by adopting an extended Kalman filtering method, substituting the depth value into a generalized inverse matrix of a calculated image jacobian matrix, and simultaneously calculating expected characteristics and current characteristics to obtain an image characteristic error e;

s5: according to the current state of the visual servo system of the mechanical arm, a self-adaptive servo gain algorithm based on depth deterministic strategy gradient DDPG is adopted to determine a servo gain value, and the process is as follows:

s51: defining parameters of a self-adaptive servo gain DDPG model;

the defining adaptive servo gain DDPG model parameters comprises the following steps:

(1) Selection of a state space: the original speed of the camera is selected as the state space,

the selection of the camera raw speed as a state space can be expressed as:

wherein S is a state space, S is a state space subelement,e is the camera raw speed, e is the image characteristic error,for interaction matrix L _p Is a generalized inverse matrix of (2);

(2) And (3) action space selection: directly selecting servo gain as an action space;

(3) And (3) bonus function design: designing a neural network based on artificial experience to simulate correction terms of an artificial rewarding function, wherein the neural network input terms are current actions anda state, wherein the neural network output term is a correction term r of the rewarding function _c The method comprises the steps of carrying out a first treatment on the surface of the The bonus function r is expressed as r=r _e +r _c ；

Wherein r is _c A bonus function designed for manual work, expressed as:

wherein R is the maximum artificial reward value and is a positive number; kappa is the number of image feature points; m is the height of the image plane, cn is the width of the image plane, p _i For the current feature point, p _id Is a desired feature point;

the neural network parameters are updated in a supervision mode according to the characteristic error change rate and the average radiation distance between the characteristic points and the center of the visual field, and the correction term mainly plays a role in buffering experience errors.

S52: the adaptive servo gain algorithm based on DDPG is designed, and the specific steps are as follows:

input: randomly initializing servo gain network parameter theta ^Q And a servo gain network parameter θ ^μ Initializing a target network by copying the current network: θ ^Q’ ←θ ^Q ，θ ^μ’ ←θ ^μ Initializing a noise generator to OU _t The method comprises the steps of carrying out a first treatment on the surface of the Repeating the execution times C and the total number n of training rounds;

and (3) outputting: optimal servo gain network parameters and optimal strategies;

step 1, randomly initializing the pose of the mechanical arm, and generating an initial state s by a state generator ₀ ；

Step 2, selecting servo gain K _t ＝a _t ＝β(s _t )＝μ(s _t |θ ^μ )+OU _t Wherein s is _t A is the current state, a _t Beta is a state generator, mu is an action strategy function;

step 3, calculating the camera speed v according to the servo controller SMCC-IBVS of the servo gain Kt _c And executing a camera speed according to the robot arm controllerObserving the degree Vc to obtain the artificially designed current prize r _et And the state generator generates the next state value s _t+1 ；

The servo controller SMCC-IBVS: introducing a proportional controller into an image-based visual servo C-IBVS, exponentially reducing servo errors through the proportional controller, and compensating instability of the system by adopting a sliding mode control SMC to improve the stability of the system, and providing an enhancement method combining the proportional control and the sliding mode control SMC, wherein the design of the image-based visual servo controller is carried out according to a control law formula and is called as a servo controller SMCC-IBVS;

the control law formula:

wherein Vc is the camera speed; k (K) _p Is the product value of a fixed nonnegative number and the identity matrix, namely the proportional servo gain value; k (K) _sm The product value of a fixed nonnegative number and the identity matrix, namely a sliding mode control gain value;representing interaction matrix L _p Is a generalized inverse matrix of (2); sm represents a sliding surface; sat (·) is the saturation function;

the saturation function formula:

wherein sgn (·) represents the sign function

Step 4, obtaining the current rewarding value r by combining the correction term output by the neural network _t Updating the neural network parameters in a supervision mode;

step 5, four-element group(s) _t ，a _t ，r _t ，s _t+1 ) Stored into the replay buffer M;

step 6, if the next stateState s _t+1 If the training is in the termination state, the training is terminated, and the step 8 is executed;

step 7, if the buffer M is full, randomly selecting N sample quadruplets (s _t ，a _t ，r _t ，s _t+1 ) Training the critics and actor neural networks respectively, and updating target network parameters;

and 8, returning to the step 2, and repeatedly executing C times.

Step 9, detecting the number of training rounds, and returning to the step 1 if the total number of training rounds does not reach n; otherwise, exiting;

s53: training an adaptive servo gain DDPG model: building a DDPG training environment; setting DDPG training parameters;

the DDPG training environment is built: adaptive servo gain of DDPG IBVS controller was designed using Peter corpe's visual servo toolbox, reinforcement learning algorithm was built by means of MATLAB reinforcement learning toolbox Reinforcement Learning Toolbox, and on this basis, modified term neural network was added, the camera in simulation had the same internal parameters as the camera used in the physical world.

The setting of DDPG training parameters: the training process DDPG comprises two hidden layers, wherein the numbers of the first layer neural network units and the second layer neural network units of the hidden layers are respectively 60 and 40; the current network of the Actor and the target network of the Actor receive the six-dimensional state s as input and act a as output; the Critic current network and the Critic target network receive the state action pair (s, a) as input and the one-dimensional action cost function Q (s, a) as output. The activation function of the neural network in the hidden layer is set as a Relu function; while the active function of the Actor-Actor current network and the Actor-target network output layer adopts the modified Tanh function to ensure the output range k _p ∈(0,2.0]，k _sm ∈(0,3.0]The method comprises the steps of carrying out a first treatment on the surface of the Randomly selecting n=64 samples from the retransmission buffer M for training the actor and commenting on the home neural network, respectively, each round of training; the learning rate is set to 0.0001 when the Actor and Critic networks are trained, and the updating rate of the two target networks is set to 0.001; setting the artificial maximum reward value R to be 1.5, and correcting the hidden layer of the term neural networkThe number of neural network elements is h=20, and the activation function of the output layer also uses the Tanh function to ensure the output range r _c ∈[-1.5,1.5]The learning rate during training was set to 0.001.

S6: servo control is performed on the mechanical arm: inputting the image characteristic error, the generalized inverse matrix of the image jacobian matrix and the servo gain value into a servo controller SMCC-IBVS, generating a mechanical arm control signal to control the mechanical arm to move, and returning to the step S3 if the target pose is not reached yet.

Compared with the prior art, the invention has obvious beneficial effects, and the scheme can be known that the invention introduces the proportion controller into the visual servo (C-IBVS) based on the image, reduces the servo error in an exponential manner through the proportion controller, adopts the sliding mode control SMC to compensate the instability of the system, and proposes the design of the SMCC-IBVS visual servo controller integrating the proportion control and the sliding mode control; the servo gain selection process is constructed as a Markov Decision Process (MDP) model, an adaptive servo gain algorithm based on depth deterministic strategy gradient (DDPG) is designed on the basis, the servo gain of an adaptive adjustment controller (SMCC-IBVS) is self-adaptively adjusted through depth reinforcement learning, the servo error is reduced, and the stability is improved. In a word, the servo gain of the self-adaptive adjustment controller (SMCC-IBVS) is improved through deep reinforcement learning, the convergence speed is increased, the servo precision and stability are improved, the feature loss is avoided, and a control method with stronger stability, higher robustness and higher convergence speed is provided for intelligent control of the mechanical arm.

The advantageous effects of the present invention are further described below by means of the specific embodiments.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a neural network diagram of the present invention;

FIG. 3 is a flow chart of a DDPG-based adaptive servo gain algorithm of the present invention;

FIG. 4 is a representation of training data of the present invention;

FIG. 5 is a diagram of an embodiment of the present invention.

Detailed Description

The following detailed description of specific embodiments, features and effects of a method for controlling a mechanical arm based on deep reinforcement learning according to the present invention is given with reference to the accompanying drawings and preferred embodiments.

Referring to fig. 1, the mechanical arm control method based on deep reinforcement learning of the invention comprises the following steps:

s1: building a mechanical arm visual servo system of eye-in-hand configuration, wherein the visual servo system comprises: the system consists of a seven-degree-of-freedom mechanical arm, wherein a camera is integrated on an end effector, a trained depth deterministic strategy gradient (DDPG) algorithm is responsible for generating proper servo gain required under each state in the whole system, then the servo gain is transmitted to a hybrid controller to generate the running speed of the end camera of the mechanical arm at the next moment, namely the end effector, the running state of each joint is determined by the robot according to the end speed to control the movement of the robot, then the next characteristic image is collected by cyclic execution, the generalized inverse matrix of an image jacobian matrix and the current characteristic error are generated, the generalized inverse matrix of the image jacobian matrix and the current characteristic error are transmitted to a state generator to generate the state at the moment, the servo gain is determined for preparing input for the next DDPG algorithm, and the servo gain is also used for generating the final camera speed transmitted into the joint controller in the hybrid controller. The system also optimizes the self-adaptive servo gain DDPG algorithm trained by simulation, and in view of a certain gap between simulation and reality, the DDPG trained by simulation is transplanted into a real object and is further trained in an actual scene by combining the system so as to make up the difference between simulation and reality, so that the aim of optimizing the DDPG algorithm is fulfilled.

S2: acquiring a desired image shot when the mechanical arm grabs a target object, and extracting desired characteristics;

s3: initializing the state of the robotic arm and ensuring that the target is within the camera's field of view at this time,

s4: acquiring a target image at the moment, and extracting current target characteristics;

s5: estimating a depth value Z by adopting an extended Kalman filtering method, substituting the depth value Z into a generalized inverse matrix of a calculated image jacobian matrix, and simultaneously calculating expected characteristics and current characteristics to obtain characteristic errors;

s6: according to the current state, determining a current servo gain value by adopting a self-adaptive servo gain algorithm based on DDPG;

s61: defining adaptive servo gain DDPG model parameters, comprising:

s611: selection of a state space: for IBVS, the servo gain is related to the feature error e and the spatial position of the feature in the camera coordinate system. Further, since the dimension of the camera speed is fixed even with many feature points and the states of the other joints are relatively independent for any type of robot arm end camera, the camera speed is selected as the state space, which can then be expressed as:

s612: and (3) action space selection: directly selecting servo gain as an action space;

s613: and (3) bonus function design: the bonus function is designed based on achieving the goal of moving the features to the desired position, i.e., avoiding feature loss, improving servo efficiency. The bonus function is mainly dependent on three conditions: reaching the desired location, the feature being outside the field of view and reducing feature errors as quickly as possible. The invention designs a neural network based on artificial experience to simulate the correction term of an artificial rewarding function, wherein the input term is the current action and state space, and the correction term r of the rewarding function is output _c As shown in fig. 2. The neural network parameters of the correction term are updated in a supervised manner according to the target distance change rate (characteristic error change rate) and the average radiation distance between the characteristic point and the center of the visual field, and the correction term mainly plays the role of buffering the empirical error, so that the reward function is expressed as r=r _e +r _c 。

The artificially designed reward function is expressed as:

wherein R represents the artificial maximum rewarding value and is a positive number; kappa represents the number of feature points; m represents the height of the image plane and cn represents the width of the image plane.

S62: designing a self-adaptive servo gain algorithm based on DDPG, wherein a DDPG algorithm flow chart is shown in FIG. 3;

the adaptive servo gain algorithm based on DDPG is described as follows:

input: randomly initializing theta ^Q And theta ^μ Initializing a target network by copying an original network: θ ^Q’ ←θ ^Q ，θ ^μ’ ←θ ^μ Initializing a noise generator to OU _t 。

And (3) outputting: optimal servo gain network parameter θ ^μ And theta ^Q And an optimization strategy.

1. Randomly initializing the pose of the mechanical arm, acquiring related data, and generating an initial state s by a state generator ₀ 。

2. Selecting a servo gain K _t ＝a _t ＝β(s _t )＝μ(s _t |θ ^μ )+OU _t 。

3. Calculating the camera speed v according to the SMCC-IBVS controller with servo gain of Kt _c And obtaining the artificial design rewards r according to the camera speed observation executed by the mechanical arm controller _et And the state generator generates the next state value s _t+1 。

4. R combined with correction term neural network output _ct Obtaining a prize value r _t And updating the parameters of the correction term neural network in a supervision mode.

5. Four-tuple(s) _t ，a _t ，r _t ，s _t+1 ) Stored into the replay buffer M.

6. If the next state s _t+1 If the training is in the termination state, the training is terminated, and the step 8 is returned.

7. If the buffer M is full, then N quaternions (s _t ，a _t ，r _t ，s _t+1 ) Then according to the formulaAndtraining the reviewer and actor neural networks, respectively, and finally +.>And updating the target network parameters.

Wherein the adaptive law of the servo gain will be governed by a parameterization strategy θ ^μ : s→a, the policy will be calculated using the DDPG algorithm. Parameter vector θ ^μ Given by the deterministic policy gradient theorem.

Where β represents a random strategy for interacting with the environment, which is a parameter generated to find more potentially optimal strategies. Q (s, a) is an action cost function for evaluating long-term rewards for state action pairs, the value of which is calculated as follows.

DDPG is a strategy-based deep reinforcement learning algorithm that can effectively use neural networks to solve the continuous MDP problem. The whole DDPG involves four neural networks, namely a Critic target network (target) Q' and a Critic current network (online) Q; an Actor target network μ' and an Actor current network μ.

Critic current network (online) Q vs. parameter θ ^Q Adopts a TD error mode in DQN, and the loss function is the minimum mean square error:

L(θ ^Q )＝E _β，P [(Q(s _t ，a _t |θ ^Q )-y _i，t ) ² ] (3)

wherein y is _i，t Is the target action value at time t, and is used for calculatingThe target Critic network Q 'and the Actor target network μ' are reached. The L can be calculated for theta based on a standard backward propagation method with the loss function L ^Q Gradient of (2)Optimizing and updating the obtained product to obtain theta ^Q 。

y _i，t ＝r(s _t ，a _t )+γQ′(s _t+1 ，μ′(s _t+1 θ ^μ′ )|θ ^Q′ ) (4)

Actor current network (online) μ versus network parameter θ ^μ Is determined by equation (1) following the deterministic policy gradient theorem. The updated formula approximated by the neural network is expressed as follows:

the update of the network parameters of the target networks Q' and μ adopts a moving average manner, i.e., soft update:

where τ is a super parameter that determines the update rate.

8. Returning to step 2, repeating the execution C times.

9. Detecting the number of training rounds, if the total number of training rounds reaches n, exiting, otherwise returning to the step 1.

S63: the IBVS controller was designed using Peter Corke's visual servoing toolbox, the reinforcement learning algorithm was built with MATLAB's reinforcement learning toolbox Reinforcement Learning Toolbox, and on this basis, a modified term neural network was added, with the camera in the simulation having the same internal parameters as the camera used in the physical world.

S64: setting parameters in the DDPG simulation training and completing the DDPG training of the adaptive servo gain, wherein part of the training data set is shown in fig. 4, and the initial state position of the camera training is shown by a circle. The orientation of the camera is represented by an x-axis, y-axis and z-axis vector frame. The pose that the camera is expected to reach is represented by three thick axes without circles;

the DDPG training parameter setting: the training process DDPG comprises two hidden layers, wherein the number of the neural network units of the first layer and the second layer of the hidden layers is 60 and 40 respectively. The Actor (Actor) current network and Actor target network receive as input the six-dimensional state s and as output the action a. The criticizer (Critic) current network and the criticizer target network receive the state action pair (s, a) as input and the one-dimensional action cost function Q (s, a) as output. The activation function of the neural network in the hidden layer is set to the Relu function. While the active function of the Actor (Actor) current network and Actor target network output layer adopts the modified Tanh function to ensure output range k _p ∈(0，2.0]，k _sm ∈(0，3.0]. At each round of training, n=64 samples were randomly selected from the retransmission buffer M to train the actor and comment on the home network, respectively. The learning rate was set to 0.0001 for both Actor (Actor) and Critic (Critic) networks training, while the update rate was set to 0.001 for both target networks. The artificial maximum reward value R is set to be 1.5, the number of hidden layer neural network elements of the correction term neural network is n=20, and the activation function of the output layer also adopts the Tanh function to ensure the output range R _c ∈[-1.5，1.5]The learning rate during training was set to 0.001.

S65: and training a model according to the DDPG self-adaptive servo gain algorithm flow in the simulation, maintaining, loading model parameters, and performing migration training on the real mechanical arm, wherein the process steps of controlling the mechanical arm training network in the real world are the same as the training flow of the network in the simulation until the model training in the real mechanical arm control is re-converged, and maintaining the model.

S7: the characteristic errors, the generalized inverse matrix of the image jacobian matrix and the servo gain values are transmitted to a servo controller, a mechanical arm control signal is generated to control the mechanical arm to move, and if the target pose is not reached yet, the S4 is returned to continue to execute;

the servo controller design: the proportion controller is introduced into the visual servo (C-IBVS) based on the image, the servo error is reduced in an exponential mode through the proportion controller, and in order to improve the stability of the system, the sliding mode control SMC is adopted to compensate the instability of the system, an enhancement method combining the proportion control and the sliding mode control SMC is provided, and the design of the visual servo controller based on the image is carried out according to a control rate formula, which is called as an SMCC-IBVS controller.

The control rate formula:

wherein K is _p The value of the product of a fixed nonnegative number and an identity matrix is generally expressed as a proportional servo gain value; k (K) _sm The value of the product of a fixed nonnegative number and an identity matrix is generally expressed as a sliding mode control gain value;representing interaction matrix L _p Is a generalized inverse matrix of (2); sm represents a sliding surface; sat (·) is the saturation function.

The saturation function formula:

performance analysis:

to demonstrate the effectiveness of the IBVS system proposed by the present invention, experiments were performed using a seven degree of freedom Baxter robot equipped with an integrated camera for viewing planar objects, as shown in fig. 5. The round target object is stationary relative to the mechanical arm base, four feature points on the round target object are taken as small rectangular center positions in fig. 5 respectively, and when the image feature error is smaller than a given pixel threshold value, the task is completed, wherein the threshold value is 2 pixels in size. The initial pose and the expected pose are:

firstly, training in a physical scene further based on a simulated training DDPG network, then carrying out the scene experiment by utilizing a trained self-adaptive gain network, and comparing with Dyna-Q learning IBVS, a fixed gain SMCC-IBVS and other systems, wherein the fixed gain value is set to K _p ＝0.08I，K _sm 200 sets of experiments were repeated, each with =2i, resulting in experimental data of table 1. In the experiment, the servo system provided by the invention can converge and the image track is smooth only by about 4.3s, the fixed gain SMCC-IBVS consumes 18.6s to reach the expected pose, and the fixed gain SMCC-IBVS can shake at the expected pose at some time and cannot be positioned accurately. Because the mechanical arm has noise, the operation environment is considered to have noise, namely the algorithm has an anti-interference effect on the noise. In addition, the IBVS system provided by the invention does not need camera calibration, which shows that the system has robustness to calibration errors.

TABLE 1Baxter mechanical arm visual servo control experiment results

According to the experimental results of table 1, in the precise positioning of the visual servo system control mechanical arm, compared with the fixed gain SMCC-IBVS, the convergence rate of the IBVS system is improved by 76.9%, the accuracy is improved by 6.3mm, and the characteristic loss phenomenon does not occur in the control process. Experimental results show that the characteristic loss rate of the Dyna-Q learning servo gain system still exists by 8% compared with the Dyna-Q learning servo gain system.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the invention in any way, and any simple modification, equivalent variation and variation of the above embodiment according to the technical matter of the present invention still fall within the scope of the technical scheme of the present invention.

Claims

1. A mechanical arm control method based on deep reinforcement learning is characterized in that: the method comprises the following steps:

s3: acquiring a current target image and extracting current characteristics;

s51: defining adaptive servo gain DDPG model parameters, comprising:

(3) And (3) bonus function design: designing a neural network based on artificial experience to simulate correction terms of an artificial rewarding function, wherein the input terms of the neural network are current actions and states, and the output terms of the neural network are correction terms r of the rewarding function _c The method comprises the steps of carrying out a first treatment on the surface of the The bonus function r is expressed as r=r _e +r _c The method comprises the steps of carrying out a first treatment on the surface of the Wherein r is _c A bonus function designed for manual work, expressed as:

wherein R is the maximum artificial reward value and is a positive number; kappa is the number of image feature points; rn is the height of the image plane, cn is the width of the image plane, p _i For the current feature point, p _id Is a desired feature point;

step 3, according to the servo gain K _t Is used for calculating the camera speed v by the servo controller SMCC-IBVS _c And the current prize r of the artificial design is obtained by observing the camera speed Vc according to the mechanical arm controller _et And the state generator generates the next state value s _t+1 ；

step 5, four-element group(s) _t ,a _t ,r _t ,s _t+1 ) Stored into the replay buffer M;

step 6, if the next state s _t+1 If the state is the termination state, the training is terminated,executing step 8;

step 7, if the buffer M is full, randomly selecting N sample quadruplets (s _t ,a _t ,r _t ,s _t+1 ) Training the critics and actor neural networks respectively, and updating target network parameters;

and 8, returning to the step 2, and repeatedly executing C times.

s53: training an adaptive servo gain DDPG model: building a DDPG training environment; setting DDPG training parameters and finishing training;

s6: servo control is performed on the mechanical arm: inputting the image characteristic error, the generalized inverse matrix of the image jacobian matrix and the servo gain value into a servo controller SMCC-IBVS, generating a mechanical arm control signal to control the mechanical arm to move, and returning to S3 if the target pose is not reached yet.

2. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 1, wherein: in step S1, the robot vision servo system includes a seven-degree-of-freedom robot and a controller, and a camera is integrated on the robot end effector.

3. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 1, wherein: in step S51 (1), the selection of the camera raw speed as the state space may be expressed as:

wherein S is a state space, S is a state space subelement,e is the camera raw speed, e is the image characteristic error,for interaction matrix L _p Is a generalized inverse matrix of (a).

4. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 1, wherein: in step S51 (3), parameters of the neural network are updated in a supervised manner according to the characteristic error change rate and the average radiation distance between the characteristic point and the center of the field of view, and the correction term mainly serves the purpose of buffering the empirical error.

5. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 1, wherein: in step S52, in step S3, the servo controller SMCC-IBVS: the method comprises the steps of introducing a proportional controller into an image-based visual servo C-IBVS, exponentially reducing servo errors through the proportional controller, and compensating instability of the system by adopting a sliding mode control SMC to compensate the instability of the system on the basis of the exponential reduction of servo errors through the proportional controller.

6. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 5, wherein:

the control law formula:

the saturation function formula:

where sgn (·) represents the sign function.

7. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 1, wherein: in step S53, the DDPG training environment is built: adaptive servo gain of DDPG IBVS controller was designed using the toolbox of visual servo of petercorpe, reinforcement learning algorithm was built by means of reinforcement learning toolbox Reinforcement LearningToolbox of MATLAB, and on this basis, modified term neural network was added, and the camera in simulation had the same internal parameters as the camera used in the physical world.

8. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 1, wherein: in step S53, the DDPG training parameters are set: the training process DDPG comprises two hidden layers; the current network of the Actor and the target network of the Actor receive the six-dimensional state s as input and act a as output; the Critic current network and the Critic target network receive the state action pair (s, a) as input, and take a one-dimensional action cost function Q (s, a) as output; the activation function of the neural network in the hidden layer is set as a Relu function; while the active function of the Actor-Actor current network and the Actor-target network output layer adopts the modified Tanh function to ensure the output range k _p ∈(0,2.0]，k _sm ∈(0,3.0]The method comprises the steps of carrying out a first treatment on the surface of the Randomly selecting n=64 samples from the retransmission buffer M for training the actor and commenting on the home neural network, respectively, each round of training; the learning rate is set to 0.0001 during training of the Actor and Critic networks, and the updating rate of the target network is set to 0.001; setting the artificial maximum reward value R to be 1.5, and correcting the hidden layer neural network list of the term neural networkThe number of elements is h=20, and the activation function of the output layer also employs the Tanh function to ensure the output range r _c ∈[-1.5,1.5]The learning rate during training was set to 0.001.

9. The method for controlling a mechanical arm based on deep reinforcement learning according to claim 8, wherein: the first layer and the second layer of the two hidden layers have the neural network units 60 and 40 respectively.