CN115890744A

CN115890744A - TD 3-based manipulator 6-DOF object manipulation training method and system

Info

Publication number: CN115890744A
Application number: CN202211612997.7A
Authority: CN
Inventors: 周勇; 贺辉腾; 李卫东; 胡楷雄
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-04-04

Abstract

The invention relates to a TD 3-based robot arm 6-DOF object manipulation training method and system, which comprises the following steps: establishing a Markov decision process model captured by a robot, and determining the state, action and return of the capturing process; obtaining depth information and force feedback in a scene; collecting data in a simulation environment based on a designed simulated learning algorithm, and performing pre-training of a deep reinforcement learning strategy and a value network based on a TD3 algorithm; the pre-trained deep reinforcement learning strategy network outputs a robot motion strategy according to the acquired environment state parameters, the mechanical arm executes grabbing actions according to the motion strategy, and the strategy is continuously updated in interaction, so that target objects in different shapes are grabbed at more proper poses and tracks. The invention can effectively avoid the problem of environmental information loss caused by the shielding of mechanical arms and other parts on objects at a single visual angle, and can select proper grabbing postures for the objects with different sizes, shapes and placing postures.

Description

TD 3-based manipulator 6-DOF object manipulation training method and system

Technical Field

The invention relates to a model training system for cooperating with object manipulation of a mechanical arm, in particular to a training system for generating a model aiming at six-degree-of-freedom manipulation and path generation of the mechanical arm in a complex environment.

Background

After the proposal of industry 4.0 ″, the modern manufacturing industry gradually shifts from large-scale production to large-scale customization, so that the product individuality is increased and the product life cycle is shortened. The cooperative robot is characterized by light weight, small volume and safety, and is applied to tasks such as assembly, disassembly, lightweight processing (3C, automobile, packaging and the like), detection and the like in a large scale, the tasks put higher requirements on the flexibility and the intelligent degree of the mechanical arm, and the flexible operation of objects by the mechanical arm is an indispensable part for realizing the intelligence of the robot.

The autonomous grasping requires that a mechanical arm can autonomously select a grasping strategy according to the environment where the mechanical arm is located, in order to achieve the goal, the mechanical arm needs to interact with the environment, and make a decision and operate an object according to information acquired by a sensor in the interaction, so that the autonomous grasping is a challenging problem related to sensing, planning and controlling. In the past, mechanical arm objects apply most of the technology adopting machine vision, the object manipulation is carried out on the objects with relatively fixed shapes and sizes and no obvious obstruction, a clamping device keeps the posture vertical to a working plane, and the target object position is detected by extracting features from image information under a single view angle.

In the past, most of manipulator manipulation applications concentrate on the handling, sorting and the like of industrial parts, when manipulating objects with more complex shapes, the traditional manipulator manipulation method obviously cannot meet the requirements, and the neural network model learning method needs to collect a large amount of data in an actual environment, so that the cost is high. When facing complex manipulation tasks, the top-down manipulation mode limits the variety of manipulations to some extent, and when encountering a situation where there is a blockage above the object or where the task imposes additional constraints, it is almost impossible to manipulate the object.

The traditional industrial part manipulation usually adopts the manually designed characteristics to detect the target object, and the geometric driving method has poor effect on household objects with complicated object shapes. However, strategies based on deep learning and machine learning both require a large amount of data or samples to be provided manually, and are difficult, time-consuming and inflexible to extend to new objects.

Disclosure of Invention

The invention aims to provide a method and a system for robot arm 6-DOF object manipulation training based on TD3, aiming at the defects of the existing method.

In order to solve the technical problem, the invention adopts the following technical scheme:

the TD 3-based robot arm 6-DOF object manipulation training method comprises the following steps of:

step S1, establishing a Markov decision process model of a manipulator control task, and determining an environmental state, an action and a return in a control process;

s2, preprocessing the point cloud acquired by the cameras on the two sides of the workbench by a mechanical arm data acquisition and preprocessing module, and fusing stress data acquired by a mechanical arm terminal force sensor and joint angles of the mechanical arm to obtain an environment state S;

s3, outputting a proper control pose X and a proper waypoint parameter omega by a control pose evaluation module according to the environmental state S obtained by the data acquisition and preprocessing module by utilizing a strategy and value network based on a PointNet framework, and estimating an expected return;

s4, the path optimization module utilizes a task parameterized Gaussian mixture model to generate operation path points according to the operation pose X and the path point parameter omega output by the operation pose evaluation module;

and S5, controlling the mechanical arm to operate the target object in a proper direction and track by the training module according to the operation waypoint generated by the path optimization module, storing the obtained data in interaction and updating the network, thereby training a neural network model to operate objects in different shapes in a proper direction according to the input environment state.

Further, in step S1, a markov decision process is used to model the grabbing process, and the steps are as follows:

step 1.1, using the target grabbing process to satisfy Markov property, and using the following formula to represent the Markov decision process:

M＝(S,A,P,R,γ)

in the formula, s represents a set of all possible states in the environment, and the states in the grasping model are fusion data of three-dimensional point cloud collected by a camera and the tail end gesture of the robot; a represents the set of all possible actions executed by the agent, and the action is the pose X with six degrees of freedom of the tail end of the actuator in the grasping model; p is the state transition probability, i.e. the probability of reaching state s' after taking action A in state s; representing s' in the grasping model by an environmental state parameter observed after taking action A; r is reward, a sparse reward function is adopted, after grabbing action A is executed in a state s, if grabbing is successful, a reward is given according to grabbing posture and stability, if grabbing is not completed, the reward is 0, gamma is a discount factor, and gamma belongs to [0,1];

step 1.2, the robot and environment interact in a series of discrete time steps (t =0,1,2, 3..) represented by a trajectory:

τ＝(S ₀ ，A ₀ ，R ₀ ，s ₁ ，A ₁ ，R ₁ ，S ₂ ，A ₂ ，R ₂ ，…S _t ，A _t ，R _t )

in the formula, S _t ，A _t ，R _t Respectively representing the state, the action and the reward in the Markov decision process at t moment;

step 1.3, finding a control strategy mu in the task of capturing reinforcement learning; s → A, maximizing the expected fetch reward G _t ：

Using action value function Q ^μ (s _t ，a _t ) To describe the expected return captured under policy μ:

in the formula, Q ^μ (s _t ，a _t ) Representing the function of the action value under the strategy μ, E _μ [G _t |S _t ＝s _t ，A _t ＝a _t ]Represents a state s _t When taking action a _t Time, expected reward G under policy μ _t The mathematical expectation of (c);

step 1.4, in order to find the optimal strategy mu ^* Finding the optimal action value function Q is required ^* (s _t ，a _t ) Then, the optimal action value function is solved to obtain the optimal strategy mu ^* Processing with Bellman equation:

in the formula, s _t+1 P represents the observed quantity at time t +1, μ(s), observed from the environment _t+1 ) Represents the state s under the strategy mu at the time t +1 _t+1 Mapped action a _t+1 ；r(s _t ，a _t ) Is a state s _t When taking action a _t Reward of time, E (. Sup.) -represents the expected return G under the strategy μ _t The mathematical expectation of (2).

Further, the set s of all possible states in the environment includes an environment state point cloud C, a robot end pose X, and joint angles

And feedback information d of the gripper, wherein n represents the nth joint of the robot, theta _i Representing the angle of the ith joint of the robot.

Further, in the step S2, the data acquisition and preprocessing module is used to perform stitching, noise reduction and filtering on the point clouds acquired by the plurality of depth cameras and to fuse the multi-sensor information to obtain the environmental state parameter S, and the steps are as follows:

step 2.1, acquiring a camera C according to camera internal parameters provided by a camera manufacturer _i Focal length of (f) _x ,f _y ) And the optical center (C) _x ,C _y )，By marking the camera C by Zhang-Yongyou _i Calibrating with a robot base coordinate system to obtain an external parameter matrix P and a camera C of the robot base coordinate system _i Acquired one-frame depth image

Any one pixel point (u, v, 1) ^T And the position (x) of the point relative to the robot base coordinate system _c ,y _c ,Z _c ) ^T And a mapping relation exists between the two:

in the formula (I), the compound is shown in the specification,

for the external reference matrix P, N cameras C in the working scene can be obtained by the formula _i,i∈N Collected frame image->

Corresponding three-dimensional point cloud data->

Step 2.2, three-dimensional point cloud data acquired by cameras with different depths

Splicing to complete dense mapping of the scene, filtering by a voxel network downsampling algorithm, and denoising by Gaussian filtering to obtain an environment state point cloud C;

step 2.3, acquiring the terminal pose X and the joint angle of the robot

And feedback information d of the clamper is fused with the environmental point cloud C to be used as an environmental state S.

Furthermore, the strategy and value network described in step S3 is based on PointNet architecture, and the strategy network is composed of a point cloud input layer, two affine transformation modules, and a plurality of layers of perceptrons sharing weights, and the strategy network is composed of an affine transformation module for rotating the input state point cloud C, a plurality of layers of perceptrons extracting features, an affine transformation module for extracting features and performing maximum pooling operation, and the obtained vector is spliced with the current pose of the mechanical arm and the holder state information, and the motion instruction of the end of the mechanical arm and the estimation of the expected return are output through a plurality of full connection layers.

Further, the generation of the operation waypoints in step S4 is specifically:

firstly, a person teaches n grabbing tracks

Each data point contains the time step t, the instantaneous three-dimensional cartesian position and pose ξ = [ t ] for the robot gripper ^T ] ^T And a reference system for the object to be grabbed>

And training a Gaussian mixture model mu ₀ Obtaining the mapping relation mu from the time step t of the current object reference system P to the pose X of the robot terminal ₀ :t→X ^T And generating a control path according to a reference system P corresponding to the control pose X and the waypoint parameter omega, wherein A _p ，b _p For grabbing a rotation matrix 3x3 and a translation matrix 3x1 of the object reference system relative to the robot base coordinate system, a decision is made as to whether or not the robot is based on a criterion or not>

Representing a set of p object reference frames.

Further, the TD3 algorithm-based deep reinforcement learning capture strategy in step S5 specifically includes: using a policy network mu (· | theta) ^μ ) To fit the grab strategy μ: s → A, and two value networks

To fit the action value function Q ^μ (s _t ,a _t ) And these networks also have a target network mu' (. Cndot. |. Theta.) ^μ′ )、/>

To solve the bootstrapping problem, where μ (· | θ) ^μ ) Representing a network parameter of theta ^u The policy network of (1).

Further, the learning process is divided into two stages: a simulation phase and an actual steering phase. In the simulation stage, a manipulation simulation platform is mainly built in robot simulation software, and the grabbing model training is carried out by utilizing the training process of claim 1. In the actual manipulation stage, because the simulation environment has a certain difference from the real environment, the domain migration module needs to migrate the model learned in the simulation stage to the real environment by using the Gan network.

The invention also provides a mechanical arm 6-DOF object manipulation training system based on TD3, which comprises a cooperative mechanical arm, a six-axis force sensor fixed on a flange plate at the tail end of the cooperative mechanical arm) and a tail end electric clamping jaw, wherein a first camera and a second camera which can acquire three-dimensional information and an industrial CCD camera in the middle are carried on two sides of a workbench, and the cooperative mechanical arm manipulates an object from a starting point area and carries the object to a terminal point area (8) according to information acquired by the six-axis force sensor and feedback information of a robot control box in a manipulation task; the system also comprises a data acquisition and preprocessing module, an operation pose evaluation module, a path optimization module and a training module.

Further, the mechanical arm data acquisition and preprocessing module is used for preprocessing point clouds acquired by a plurality of cameras positioned on two sides of the workbench and fusing stress data acquired by a mechanical arm terminal force sensor and joint angles of the mechanical arm to obtain an environment state S;

the manipulation pose evaluation module outputs a proper manipulation pose X and a proper waypoint parameter omega by utilizing a strategy and value network based on a PointNet framework according to the environmental state S obtained by the data acquisition and preprocessing module, and estimates an expected return;

the path optimization module generates an operation path point by utilizing a task parameterized Gaussian mixture model according to the operation pose X and the path point parameter omega output by the operation pose evaluation module;

the training module controls the mechanical arm to operate the target object in a proper direction and track according to the operation waypoint generated by the path optimization module, stores the obtained data in interaction and updates the network, so that the neural network model is trained to operate objects in different shapes in a proper direction according to the input environment state. Compared with the prior art, the method has the following beneficial effects:

1) The method adopts various types of information as the states in the manipulation process, simultaneously considers the influence of the pose and the moving path on the manipulation success rate of the object, and effectively improves the manipulation quality.

2) Compared with the traditional data-driven object manipulation algorithm (simulation learning, deep learning and the like), the method adopts the reinforcement learning algorithm based on TD3 and PointNet framework, can learn different manipulation modes by adjusting the reward function, is an automatic supervision algorithm, and does not need to artificially collect tagged data.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a diagram of hardware components;

FIG. 2 is an overall structural view;

FIG. 3 is a schematic overall flow diagram;

FIG. 4 is a model diagram of an algorithm Markov decision process;

FIG. 5 is a flow diagram of a data acquisition and preprocessing module;

FIG. 6 is a schematic diagram of a maneuver evaluation module;

FIG. 7 is a block diagram of a path optimization module based on a task parameterized Gaussian mixture model;

FIG. 8 is a block diagram of a training module of the TD3 based robot 6-DOF object manipulation training system.

Detailed Description

The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The present invention will be described in detail below with reference to the attached drawings, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any inventive work belong to the protection scope of the present invention.

Example (b): a TD 3-based mechanical arm object manipulation simulation training method is characterized in that as shown in figure 1, hardware components of an adopted mechanical arm simulation experiment platform comprise a cooperative mechanical arm 1, a six-axis force sensor 2 fixed on a flange plate at the tail end of the cooperative mechanical arm and a tail end electric clamping jaw 3, and

cameras

4 and 5 capable of acquiring three-dimensional information and an industrial CCD camera 6 in the middle are carried on two sides of a workbench. In the manipulation task, the cooperative robot arm 1 manipulates an object from the area 7 and conveys the object to the area 8 based on information obtained from the six-axis force sensor 2 and the robot control box feedback information.

The overall system of the TD 3-based robot arm 6-DOF object manipulation training system is shown in FIG. 2 and comprises a data acquisition and preprocessing module, a manipulation pose evaluation module, a path optimization module and a training module.

As shown in fig. 3, the TD 3-based robot arm 6-DOF object manipulation training system comprises the following steps:

step S1, establishing a Markov decision process model of a manipulator task, and determining an environmental state, an action and a return in the manipulation process.

And S2, preprocessing the point cloud acquired by the cameras on the two sides of the workbench by the data acquisition and preprocessing module of the mechanical arm, and fusing the stress data acquired by the force sensor at the tail end of the mechanical arm and the joint angle of the mechanical arm to obtain an environment state S.

And S3, outputting a proper control pose X and a proper waypoint parameter omega by the control pose evaluation module according to the environmental state S obtained by the data acquisition and preprocessing module by utilizing a strategy and value network based on a PointNet framework, and estimating an expected return.

And S4, the path optimization module utilizes the task parameterization Gaussian mixture model to generate control road points according to the control pose X and the road point parameter omega output by the control pose evaluation module.

And S5, controlling the mechanical arm to operate the target object in a proper direction and track by the training module according to the operation waypoint generated by the path optimization module, storing the obtained data in interaction and updating the network, thereby training the operation of the neural network model on objects with different shapes in a proper direction according to the input environment state.

In the present embodiment, the markov decision process is used to model the manipulation process in step S1, and the overall architecture thereof is shown in fig. 4, and the steps thereof are as follows:

step S11, using the target manipulation process to satisfy Markov property, and using an equation (1) to represent the Markov decision process:

M＝(S，A，P，R，γ) (1)

in equation (1), S represents the set of all possible states in the environment, including the point cloud C, the robot arm end direction x, and the joint angle in this manipulation model

And feedback information d of the gripper, wherein the information fed back by the gripper is whether the gripping jaw is closed, if the gripping action is performed, the gripping jaw is not closedThe full closing indicates that the grabbing is successful, and the full closing indicates that the grabbing fails;

a represents the set of all possible actions performed by the agent, where the behavior in this manipulation model is the estimated manipulation pose x and waypoint parameters ω of the actuator (in this embodiment the waypoint parameters are the manipulation intermediate poses); in this embodiment, in order to prevent the robot arm from colliding with the platform during the movement process, thereby causing damage, the movement range of the robot arm is limited.

P is the state transition probability, i.e. the probability of reaching the state S' after the action A is taken in the state S; representing S' in this steering model by the environmental state parameters observed after taking action a;

r is reward, the invention adopts sparse reward function, after executing the operation action A under the state s, if the operation task is successful, a reward is given according to the operation pose and stability, if the operation task is not completed, the reward is 0. The reward of the mechanical arm simultaneously considers the maneuverability index of the mechanical arm and the operation stability index of the mechanical arm so that the mechanical arm can operate the object in a better direction and with higher operation quality.

Gamma is a discount factor, and gamma belongs to [0,1].

At step S12, the robotic arm and environment interact in a series of discrete time steps (t =0,1,2,3, \ 8230;), represented by a trace:

τ＝(S ₀ ,A ₀ ,R ₁ ,S ₁ ,A ₁ ,R ₁ ,S ₂ ,A ₂ ,R ₂ ,…) (2)

step S13, the task of the Markov decision process is to find a control strategy μ: s → A, maximizing the expected maneuver reward G _t ：

Using action value function Q ^μ (s _t ,a _t ) To describe the expected return for the maneuver under policy μ:

step S14, in order to find the optimal strategy mu ^* Finding the optimal action value function Q is required ^* (s _t ,a _t ) Then, the optimal action value function is solved to obtain the optimal strategy mu ^* The Bellman equation used to process equation (4) has:

in formula (5), s _t+1 P represents the observed quantity at time t +1, μ(s), observed from the environment _t+1 ) Represents the state s under the strategy mu at the moment t +1 _t+1 Mapped action a _t+1 ，r(s _t ,a _t ) Is a state s _t When taking action a _t Reward of time, E (. Sup.) -represents the expected return G under the strategy μ _t The mathematical expectation of (2).

In an embodiment, the data acquisition and preprocessing module comprises a depth camera, a mechanical arm control cabinet and a six-axis force sensor which are positioned on two sides of a workbench in the simulation. The process of obtaining the environmental status through the data collection and pre-processing module is shown in fig. 5:

step S21, acquiring a camera C according to camera internal parameters provided by a camera manufacturer _i Focal length of (f) _x ,f _y ) And the optical center (C) _x ,C _y ) Calibrating camera C by Zhang Zhengyou calibration method _i And calibrating the base coordinate system of the mechanical arm to obtain an external parameter matrix P of the mechanical arm relative to the base coordinate system. Camera C _i Acquired one-frame depth image

Any one pixel (u, v, 1) ^T And the position (x) of the point relative to the robot base coordinate system _c ,y _c ,Z _c ) ^T And a mapping relation exists between the two:

in the formula (1), the reaction mixture is,

for the external parameter matrix P, N cameras C in the working scene can be obtained through the formula (6) _i,i∈N Collected frame image->

Corresponding three-dimensional point cloud->

Step S22, point cloud data acquired by cameras with different depths

And splicing to complete dense mapping of the scene, filtering by a voxel network downsampling algorithm, and denoising by Gaussian filtering to obtain a state point cloud C, wherein the depth camera comprises a first camera 4, a second camera 5 and an intermediate industrial CCD camera 6.

Step S23, acquiring the terminal pose X and the joint angle of the robot

And feedback information d of the gripper is fused with the point cloud C to serve as an environment state S. Where n denotes the nth joint of the robot, theta _i Representing the angle of the ith joint of the robot. />

In the embodiment, the depth camera and the force sensor are connected with the computer through serial ports; the robot arm control cabinet is connected with the computer through TCP/IP communication, and the control cabinet serves as a server and sends signals such as joint angles, terminal poses and the like to the computer serving as a client at a certain frequency. In order to facilitate management and information interaction, different sensors and mechanical arms can be created and controlled in a node terrain mode through the ROS.

In this example, the purpose of voxel down-sampling the stitched point clouds is to obtain more uniform data, because the point clouds collected at different viewing angles have an overlapping phenomenon after being stitched, and the learning of subsequent network parameters is affected by the uneven density of the point clouds. And then, carrying out Gaussian filtering on the point cloud to remove outliers, and forming environment state parameters by combining joint angles, terminal poses and stresses of the robot and clamp feedback information.

In this embodiment, the maneuver evaluating module in step S3 is a maneuver strategy based on TD3 algorithm, and the strategy includes a strategy network μ (· | θ |) ^μ ) To fit the steering strategy μ: s → A, and two value networks

To solve the bootstrap problem.

Further, the network structure of the policy and value network described in step S3 is shown in fig. 6, and the structure is based on the PointNet architecture. The strategy network comprises a point cloud input layer, two affine transformation modules and a plurality of layers of perception machines sharing weights, wherein an input state point cloud C is rotated through one affine transformation module, then the characteristics are extracted point by point through the plurality of layers of perception machines, and the characteristics are extracted through one affine transformation module to carry out maximum pooling operation. And splicing the obtained vector with information such as the current pose of the mechanical arm, the state of the clamp and the like, and outputting a motion instruction at the tail end of the mechanical arm and estimation of expected return after passing through a plurality of full-connection layers.

In this embodiment, the path optimization module in step S4 performs, by using a designed simulated learning algorithm based on a task parameterized gaussian mixture model, control waypoint generation according to the control pose X and waypoint parameters ω acquired in step S3, and controls the robot arm to perform a control task according to the specified waypoints. Compared with other waypoint generating algorithms, the robot has the advantages that the robot can learn a motion mode more conforming to the manipulation habit of a person from the teaching of the person, and different waypoints can be flexibly optimized by changing the waypoint parameter omega and the manipulation direction X. The implementation flow is shown in fig. 7:

step S41, firstly, teaching n operation tracks by human

Each data point contains a time step t, and an instantaneous three-dimensional cartesian direction ξ = [ tx = ] of the manipulator arm gripper ^T ] ^T . The task parameter P may be expressed as a manipulation direction X and an intermediate pose w corresponding to a reference system +>

Wherein A is _p Is a rotation matrix of the reference frame p relative to the robot base frame, b _p Is a homogeneous matrix of the reference frame p relative to the robot base frame.

Step S42, decoding the collected operation track data by using the reference system P corresponding to the collected operation track to obtain a decoded track

Step S43, learning Gaussian mixture model mu by using EM algorithm ₀ Fitting the mapping relation mu from the time step t to the Cartesian direction of the mechanical arm after decoding through a plurality of normal distributions ₀ :t→X′ ^T The expression of the gaussian mixture model is as follows:

wherein the content of the first and second substances,

represents an mth normal distribution relative to the p reference frame>

And &>

Mean and variance of a normal distribution are indicated.

Step S44, the learned Gaussian mixture model is converted according to the manipulator arm manipulation pose X and the road point parameter omega output by the manipulation evaluation module, and the mapping relation mu from the time step t to the Cartesian direction of the manipulator arm under the reference system P corresponding to the manipulator arm manipulation pose X and the road point parameter omega is obtained _p :t→X ^T The conversion relationship may be represented by:

step S45, performing Gaussian mixture regression on the time step t by using the decoded Gaussian mixture model to obtain a regression waypoint

In step S5 in this embodiment, the training module drives the robot arm to move according to the optimized manipulation waypoint, performs a manipulation attempt when the robot arm reaches the waypoint end point, and records the trajectory (S) in equation (2) acquired in this motion (S) _t ,a _t ,r _t ,S _t+1 ) And training a strategy network and a value network of the TD 3-based manipulation reinforcement learning model by using a random gradient descent method, wherein the method comprises the following steps:

step S51, first, withSampling data in a track tau acquired in the interaction of the simulation environment is (s, a, r, s '), and the target strategy network mu' (. Gamma. Theta.) is utilized ^μ′ ) The action at state s' is calculated:

a′＝μ′((s′|θ ^μ′ )) (12)

step S52, calculating the target value y based on the thought of the Bellman equation in the similar formula (6)

y＝r+γminQ′ _i (s′,a′|θ ^Q′ ) (13)

Step S53, minimizing the estimation value of the value network to the expected return by using a gradient descent algorithm

And a target value y _ci In order to update value network parameters &>

Step S54, utilizing the strategy mu (· | theta) ^μ ) Calculating a of action taken in state s _new And using value network computing

And maximize with a gradient ascent algorithm>

Thereby updating the policy network parameter theta ^μ 。

And step S55, updating the target network by adopting a soft updating mode. Introducing a learning rate tau, carrying out weighted average on the old target network parameters and the new network parameters, and then assigning values to the target network:

θ ^μ′ ＝τθ ^μ +(1-τ)θ ^μ′ (16)

the learning process of the 6-DOF object manipulation training system based on the TD3 algorithm is divided into two stages: the simulation phase and the actual manipulation phase, the implementation flow of which is shown in fig. 8:

in the simulation stage, a manipulation simulation platform is built in the mechanical arm simulation software, the arrangement form of the manipulation simulation platform is shown in figure 1, and the acquisition of images acquired by a visual sensor, the direction and the joint angle of the mechanical arm, and feedback data of the force sensor and the clamping jaw in the simulation environment is realized by coding through python. In a simulation environment, the manipulator estimates a manipulation direction X and a waypoint parameter omega by using the pose estimation module, generates a manipulator motion waypoint for manipulation attempt by using the path optimization module, collects a large amount of interaction data in the interaction process with the environment, and finishes the training of an object manipulation reinforcement learning model.

The simulation environment in this embodiment is built in the V-REP. One important reason for collecting data in a simulated environment is that the policy and value network of the TD3 algorithm requires a lot of interaction with the environment to achieve good results, whereas the time cost for interaction in a real environment is high, compared to the efficiency of collecting and learning data in a simulated environment.

In the actual manipulation stage, because the simulation environment and the real environment have certain difference, a domain migration module is required to migrate the model learned in the simulation stage to the real environment, firstly, the model learned in the simulation environment is used for executing a manipulation task in the real environment, and a small amount of real environment training data is collected; and then, carrying out domain migration from the simulation data to the real environment data by using a Gan network to obtain a large amount of generated data.

It should be noted that the above examples are only for illustrating the present invention and are not intended to limit the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention without making creative efforts, fall within the protection scope of the invention.

The above is a specific embodiment of the technical solution of the present invention and the technical principle applied. Any modification, supplement or equivalent replacement by a person skilled in the art on the basis of this embodiment is within the scope of the invention as claimed in the claims.

Claims

1. The TD 3-based robot arm 6-DOF object manipulation training method is characterized by comprising the following steps of:

s1, establishing a Markov decision process model of a manipulator task, and determining an environmental state, an action and a return in a manipulation process;

2. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the grabbing process is modeled by a Markov decision process in step S1, and the steps are as follows:

M＝(S,A,P,R,γ)

in the formula, S represents a set of all possible states in the environment, and the states in the grasping model are fusion data of three-dimensional point cloud collected by a camera and the tail end gesture of the robot; a represents the set of all possible actions executed by the agent, and the action is the pose X with six degrees of freedom of the tail end of the actuator in the grasping model; p is the state transition probability, i.e. the probability of reaching state s' after taking action A in state s; s' is represented in this gripping model by the environmental state parameters observed after taking action a; r is reward, a sparse reward function is adopted, after grabbing action A is executed in a state s, if grabbing is successful, a reward is given according to grabbing posture and stability, if grabbing is not completed, the reward is 0, gamma is a discount factor, and gamma belongs to [0,1];

step 1.3, finding a control strategy mu in the task of capturing reinforcement learning; s → A, maximizing the expected capture reward G _t ：

in the formula, Q ^μ (s _t ，a _t ) Delegate policies _μ Action value function of _μ [G _t |S _t ＝s _t ，A _t ＝a _t ]Represents a state s _t When taking action a _t Time, expected reward G under policy μ _t A mathematical expectation of (d);

step 1.4, in order to find the optimal strategy mu ^* Finding the optimal action value function Q is required ^* (s _t ，a _t ) Then solving the optimal action value function to obtain the optimal strategy mu ^* Processing with Bellman equation:

in the formula, s _t+1 P represents the observed quantity at time t +1, μ(s), observed from the environment _t+1 ) Represents the state s under the strategy mu at the moment t +1 _t+1 Mapped action a _t+1 ；r(s _t ，a _t ) Is a state s _t When taking action a _t Reward in hours, E (. + -.) denotes the expected return G under the strategy μ _t The mathematical expectation of (2).

3. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 2, wherein the set S of all possible states in the environment comprises environment state point cloud C, robot end pose X and joint angle

4. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein in the step S2, the data acquisition and preprocessing module is adopted to perform splicing, noise reduction and filtering processing on the point clouds acquired by the plurality of depth cameras and to fuse multi-sensor information to obtain the environmental state parameter S, and the steps are as follows:

step 2.1, acquiring a camera C according to camera internal parameters provided by a camera manufacturer _i Focal length of (f) _x ，f _y ) And optical center (C) _x ，C _y ) By marking the camera C by Zhang Zhengyou _i Calibrating with a robot base coordinate system to obtain an external parameter matrix P and a camera C of the robot base coordinate system _i Acquired one-frame depth image

Any one pixel point (u, v, 1) ^T And the position (x) of the point relative to the robot base coordinate system _c ，y _c ，Z _c ) ^T And a mapping relation exists between the two:

in the formula (I), the compound is shown in the specification,

for the external parameter matrix P, N cameras C in the working scene can be obtained by the formula _i，i∈N One frame of collected image

Corresponding three-dimensional point cloud data->

step 2.3, taking the terminal pose X and the joint angle of the robot

And feedback information d of the holder is fused with the environmental point cloud C to be used as an environmental state S.

5. The TD 3-based manipulator 6-DOF object manipulation training method according to claim 1, wherein the strategy and value network in step S3 is based on PointNet architecture, and the strategy network comprises a point cloud input layer, two affine transformation modules and a plurality of layers of sensors sharing weights, wherein an input state point cloud C is rotated through one affine transformation module, then the characteristics are extracted through the plurality of layers of sensors, the characteristics are extracted through one affine transformation module and subjected to maximum pooling operation, the obtained vector is spliced with the current pose of the manipulator and the holder state information, and the motion instruction of the end of the manipulator and the estimation of the expected return are output through a plurality of fully connected layers.

6. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the manipulation waypoint generation in step S4 is specifically:

firstly, a person teaches n grabbing tracks

Each data point contains the time step t, the instantaneous three-dimensional cartesian position and pose ξ = [ tx = [ t X ] of the robot gripper ^T ] ^T And a reference system for objects to be grabbed>

And training a Gaussian mixture model mu ₀ Obtaining the mapping relation mu from the time step t of the current object reference system P to the pose X of the robot terminal ₀ :t→X ^T And generating a control path according to a reference system P corresponding to the control pose X and the waypoint parameter omega, wherein A _p ，b _p Rotation matrix 3x3 and translation relative to robot base coordinate system for grabbing object reference systemMatrix 3x1, <' > based on>

Representing a set of p object reference frames.

7. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the TD3 algorithm-based deep reinforcement learning capture strategy of step S5 specifically comprises: using a policy network mu (· | theta) ^μ ) To fit the grabbing strategy μ: s → A, and two value networks

To fit the action value function Q ^μ (s _t ,a _t ) And these networks are also provided with target network μ' (. The ^μ′ )、/>

8. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the learning process is divided into two stages: a simulation phase and an actual steering phase. In the simulation stage, a manipulation simulation platform is mainly built in robot simulation software, and the grabbing model training is carried out by utilizing the training process of claim 1. In the actual manipulation stage, because the simulation environment has a certain difference from the real environment, the domain migration module needs to migrate the model learned in the simulation stage to the real environment by using the Gan network.

9. The mechanical arm 6-DOF object manipulation training system based on TD3 is characterized by comprising a cooperative mechanical arm (1), a six-axis force sensor (2) fixed on a flange plate at the tail end of the cooperative mechanical arm and a power-driven clamping jaw (3) at the tail end of the cooperative mechanical arm, wherein a first camera (4), a second camera (5) and an industrial CCD camera (6) in the middle are carried on two sides of a workbench, the first camera and the second camera can acquire three-dimensional information, and the cooperative mechanical arm (1) manipulates an object from a starting point area (7) and carries the object to an end point area (8) according to information acquired by the six-axis force sensor (2) and feedback information of a robot control box in a manipulation task; the system also comprises a data acquisition and preprocessing module, an operation pose evaluation module, a path optimization module and a training module.

10. The TD3 based robotic 6-DOF object manipulation training system of claim 9, wherein: the mechanical arm data acquisition and preprocessing module is used for preprocessing point clouds acquired by a plurality of cameras positioned on two sides of the workbench and fusing stress data acquired by a mechanical arm terminal force sensor and joint angles of the mechanical arm to obtain an environment state S;

the path optimization module utilizes a task parameterization Gaussian mixture model to generate control path points according to the control pose X and the path point parameter omega output by the control pose evaluation module;

the training module controls the mechanical arm to operate the target object in a proper direction and track according to the operation waypoint generated by the path optimization module, stores the obtained data in interaction and updates the network, so that the neural network model is trained to operate objects in different shapes in a proper direction according to the input environment state.