CN115890744A - TD 3-based manipulator 6-DOF object manipulation training method and system - Google Patents

TD 3-based manipulator 6-DOF object manipulation training method and system Download PDF

Info

Publication number
CN115890744A
CN115890744A CN202211612997.7A CN202211612997A CN115890744A CN 115890744 A CN115890744 A CN 115890744A CN 202211612997 A CN202211612997 A CN 202211612997A CN 115890744 A CN115890744 A CN 115890744A
Authority
CN
China
Prior art keywords
robot
state
strategy
pose
mechanical arm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211612997.7A
Other languages
Chinese (zh)
Inventor
周勇
贺辉腾
李卫东
胡楷雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202211612997.7A priority Critical patent/CN115890744A/en
Publication of CN115890744A publication Critical patent/CN115890744A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Manipulator (AREA)

Abstract

The invention relates to a TD 3-based robot arm 6-DOF object manipulation training method and system, which comprises the following steps: establishing a Markov decision process model captured by a robot, and determining the state, action and return of the capturing process; obtaining depth information and force feedback in a scene; collecting data in a simulation environment based on a designed simulated learning algorithm, and performing pre-training of a deep reinforcement learning strategy and a value network based on a TD3 algorithm; the pre-trained deep reinforcement learning strategy network outputs a robot motion strategy according to the acquired environment state parameters, the mechanical arm executes grabbing actions according to the motion strategy, and the strategy is continuously updated in interaction, so that target objects in different shapes are grabbed at more proper poses and tracks. The invention can effectively avoid the problem of environmental information loss caused by the shielding of mechanical arms and other parts on objects at a single visual angle, and can select proper grabbing postures for the objects with different sizes, shapes and placing postures.

Description

TD 3-based manipulator 6-DOF object manipulation training method and system
Technical Field
The invention relates to a model training system for cooperating with object manipulation of a mechanical arm, in particular to a training system for generating a model aiming at six-degree-of-freedom manipulation and path generation of the mechanical arm in a complex environment.
Background
After the proposal of industry 4.0 ″, the modern manufacturing industry gradually shifts from large-scale production to large-scale customization, so that the product individuality is increased and the product life cycle is shortened. The cooperative robot is characterized by light weight, small volume and safety, and is applied to tasks such as assembly, disassembly, lightweight processing (3C, automobile, packaging and the like), detection and the like in a large scale, the tasks put higher requirements on the flexibility and the intelligent degree of the mechanical arm, and the flexible operation of objects by the mechanical arm is an indispensable part for realizing the intelligence of the robot.
The autonomous grasping requires that a mechanical arm can autonomously select a grasping strategy according to the environment where the mechanical arm is located, in order to achieve the goal, the mechanical arm needs to interact with the environment, and make a decision and operate an object according to information acquired by a sensor in the interaction, so that the autonomous grasping is a challenging problem related to sensing, planning and controlling. In the past, mechanical arm objects apply most of the technology adopting machine vision, the object manipulation is carried out on the objects with relatively fixed shapes and sizes and no obvious obstruction, a clamping device keeps the posture vertical to a working plane, and the target object position is detected by extracting features from image information under a single view angle.
In the past, most of manipulator manipulation applications concentrate on the handling, sorting and the like of industrial parts, when manipulating objects with more complex shapes, the traditional manipulator manipulation method obviously cannot meet the requirements, and the neural network model learning method needs to collect a large amount of data in an actual environment, so that the cost is high. When facing complex manipulation tasks, the top-down manipulation mode limits the variety of manipulations to some extent, and when encountering a situation where there is a blockage above the object or where the task imposes additional constraints, it is almost impossible to manipulate the object.
The traditional industrial part manipulation usually adopts the manually designed characteristics to detect the target object, and the geometric driving method has poor effect on household objects with complicated object shapes. However, strategies based on deep learning and machine learning both require a large amount of data or samples to be provided manually, and are difficult, time-consuming and inflexible to extend to new objects.
Disclosure of Invention
The invention aims to provide a method and a system for robot arm 6-DOF object manipulation training based on TD3, aiming at the defects of the existing method.
In order to solve the technical problem, the invention adopts the following technical scheme:
the TD 3-based robot arm 6-DOF object manipulation training method comprises the following steps of:
step S1, establishing a Markov decision process model of a manipulator control task, and determining an environmental state, an action and a return in a control process;
s2, preprocessing the point cloud acquired by the cameras on the two sides of the workbench by a mechanical arm data acquisition and preprocessing module, and fusing stress data acquired by a mechanical arm terminal force sensor and joint angles of the mechanical arm to obtain an environment state S;
s3, outputting a proper control pose X and a proper waypoint parameter omega by a control pose evaluation module according to the environmental state S obtained by the data acquisition and preprocessing module by utilizing a strategy and value network based on a PointNet framework, and estimating an expected return;
s4, the path optimization module utilizes a task parameterized Gaussian mixture model to generate operation path points according to the operation pose X and the path point parameter omega output by the operation pose evaluation module;
and S5, controlling the mechanical arm to operate the target object in a proper direction and track by the training module according to the operation waypoint generated by the path optimization module, storing the obtained data in interaction and updating the network, thereby training a neural network model to operate objects in different shapes in a proper direction according to the input environment state.
Further, in step S1, a markov decision process is used to model the grabbing process, and the steps are as follows:
step 1.1, using the target grabbing process to satisfy Markov property, and using the following formula to represent the Markov decision process:
M=(S,A,P,R,γ)
in the formula, s represents a set of all possible states in the environment, and the states in the grasping model are fusion data of three-dimensional point cloud collected by a camera and the tail end gesture of the robot; a represents the set of all possible actions executed by the agent, and the action is the pose X with six degrees of freedom of the tail end of the actuator in the grasping model; p is the state transition probability, i.e. the probability of reaching state s' after taking action A in state s; representing s' in the grasping model by an environmental state parameter observed after taking action A; r is reward, a sparse reward function is adopted, after grabbing action A is executed in a state s, if grabbing is successful, a reward is given according to grabbing posture and stability, if grabbing is not completed, the reward is 0, gamma is a discount factor, and gamma belongs to [0,1];
step 1.2, the robot and environment interact in a series of discrete time steps (t =0,1,2, 3..) represented by a trajectory:
τ=(S 0 ,A 0 ,R 0 ,s 1 ,A 1 ,R 1 ,S 2 ,A 2 ,R 2 ,…S t ,A t ,R t )
in the formula, S t ,A t ,R t Respectively representing the state, the action and the reward in the Markov decision process at t moment;
step 1.3, finding a control strategy mu in the task of capturing reinforcement learning; s → A, maximizing the expected fetch reward G t
Figure BDA0004000838030000031
Using action value function Q μ (s t ,a t ) To describe the expected return captured under policy μ:
Figure BDA0004000838030000032
in the formula, Q μ (s t ,a t ) Representing the function of the action value under the strategy μ, E μ [G t |S t =s t ,A t =a t ]Represents a state s t When taking action a t Time, expected reward G under policy μ t The mathematical expectation of (c);
step 1.4, in order to find the optimal strategy mu * Finding the optimal action value function Q is required * (s t ,a t ) Then, the optimal action value function is solved to obtain the optimal strategy mu * Processing with Bellman equation:
Figure BDA0004000838030000033
in the formula, s t+1 P represents the observed quantity at time t +1, μ(s), observed from the environment t+1 ) Represents the state s under the strategy mu at the time t +1 t+1 Mapped action a t+1 ;r(s t ,a t ) Is a state s t When taking action a t Reward of time, E (. Sup.) -represents the expected return G under the strategy μ t The mathematical expectation of (2).
Further, the set s of all possible states in the environment includes an environment state point cloud C, a robot end pose X, and joint angles
Figure BDA0004000838030000034
And feedback information d of the gripper, wherein n represents the nth joint of the robot, theta i Representing the angle of the ith joint of the robot.
Further, in the step S2, the data acquisition and preprocessing module is used to perform stitching, noise reduction and filtering on the point clouds acquired by the plurality of depth cameras and to fuse the multi-sensor information to obtain the environmental state parameter S, and the steps are as follows:
step 2.1, acquiring a camera C according to camera internal parameters provided by a camera manufacturer i Focal length of (f) x ,f y ) And the optical center (C) x ,C y ),By marking the camera C by Zhang-Yongyou i Calibrating with a robot base coordinate system to obtain an external parameter matrix P and a camera C of the robot base coordinate system i Acquired one-frame depth image
Figure BDA0004000838030000041
Any one pixel point (u, v, 1) T And the position (x) of the point relative to the robot base coordinate system c ,y c ,Z c ) T And a mapping relation exists between the two:
Figure BDA0004000838030000042
in the formula (I), the compound is shown in the specification,
Figure BDA0004000838030000043
for the external reference matrix P, N cameras C in the working scene can be obtained by the formula i,i∈N Collected frame image->
Figure BDA0004000838030000044
Corresponding three-dimensional point cloud data->
Figure BDA0004000838030000045
Step 2.2, three-dimensional point cloud data acquired by cameras with different depths
Figure BDA0004000838030000046
Splicing to complete dense mapping of the scene, filtering by a voxel network downsampling algorithm, and denoising by Gaussian filtering to obtain an environment state point cloud C;
step 2.3, acquiring the terminal pose X and the joint angle of the robot
Figure BDA0004000838030000047
And feedback information d of the clamper is fused with the environmental point cloud C to be used as an environmental state S.
Furthermore, the strategy and value network described in step S3 is based on PointNet architecture, and the strategy network is composed of a point cloud input layer, two affine transformation modules, and a plurality of layers of perceptrons sharing weights, and the strategy network is composed of an affine transformation module for rotating the input state point cloud C, a plurality of layers of perceptrons extracting features, an affine transformation module for extracting features and performing maximum pooling operation, and the obtained vector is spliced with the current pose of the mechanical arm and the holder state information, and the motion instruction of the end of the mechanical arm and the estimation of the expected return are output through a plurality of full connection layers.
Further, the generation of the operation waypoints in step S4 is specifically:
firstly, a person teaches n grabbing tracks
Figure BDA0004000838030000048
Each data point contains the time step t, the instantaneous three-dimensional cartesian position and pose ξ = [ t ] for the robot gripper T ] T And a reference system for the object to be grabbed>
Figure BDA0004000838030000051
And training a Gaussian mixture model mu 0 Obtaining the mapping relation mu from the time step t of the current object reference system P to the pose X of the robot terminal 0 :t→X T And generating a control path according to a reference system P corresponding to the control pose X and the waypoint parameter omega, wherein A p ,b p For grabbing a rotation matrix 3x3 and a translation matrix 3x1 of the object reference system relative to the robot base coordinate system, a decision is made as to whether or not the robot is based on a criterion or not>
Figure BDA0004000838030000052
Representing a set of p object reference frames.
Further, the TD3 algorithm-based deep reinforcement learning capture strategy in step S5 specifically includes: using a policy network mu (· | theta) μ ) To fit the grab strategy μ: s → A, and two value networks
Figure BDA0004000838030000053
Figure BDA0004000838030000054
To fit the action value function Q μ (s t ,a t ) And these networks also have a target network mu' (. Cndot. |. Theta.) μ′ )、/>
Figure BDA0004000838030000055
To solve the bootstrapping problem, where μ (· | θ) μ ) Representing a network parameter of theta u The policy network of (1).
Further, the learning process is divided into two stages: a simulation phase and an actual steering phase. In the simulation stage, a manipulation simulation platform is mainly built in robot simulation software, and the grabbing model training is carried out by utilizing the training process of claim 1. In the actual manipulation stage, because the simulation environment has a certain difference from the real environment, the domain migration module needs to migrate the model learned in the simulation stage to the real environment by using the Gan network.
The invention also provides a mechanical arm 6-DOF object manipulation training system based on TD3, which comprises a cooperative mechanical arm, a six-axis force sensor fixed on a flange plate at the tail end of the cooperative mechanical arm) and a tail end electric clamping jaw, wherein a first camera and a second camera which can acquire three-dimensional information and an industrial CCD camera in the middle are carried on two sides of a workbench, and the cooperative mechanical arm manipulates an object from a starting point area and carries the object to a terminal point area (8) according to information acquired by the six-axis force sensor and feedback information of a robot control box in a manipulation task; the system also comprises a data acquisition and preprocessing module, an operation pose evaluation module, a path optimization module and a training module.
Further, the mechanical arm data acquisition and preprocessing module is used for preprocessing point clouds acquired by a plurality of cameras positioned on two sides of the workbench and fusing stress data acquired by a mechanical arm terminal force sensor and joint angles of the mechanical arm to obtain an environment state S;
the manipulation pose evaluation module outputs a proper manipulation pose X and a proper waypoint parameter omega by utilizing a strategy and value network based on a PointNet framework according to the environmental state S obtained by the data acquisition and preprocessing module, and estimates an expected return;
the path optimization module generates an operation path point by utilizing a task parameterized Gaussian mixture model according to the operation pose X and the path point parameter omega output by the operation pose evaluation module;
the training module controls the mechanical arm to operate the target object in a proper direction and track according to the operation waypoint generated by the path optimization module, stores the obtained data in interaction and updates the network, so that the neural network model is trained to operate objects in different shapes in a proper direction according to the input environment state. Compared with the prior art, the method has the following beneficial effects:
1) The method adopts various types of information as the states in the manipulation process, simultaneously considers the influence of the pose and the moving path on the manipulation success rate of the object, and effectively improves the manipulation quality.
2) Compared with the traditional data-driven object manipulation algorithm (simulation learning, deep learning and the like), the method adopts the reinforcement learning algorithm based on TD3 and PointNet framework, can learn different manipulation modes by adjusting the reward function, is an automatic supervision algorithm, and does not need to artificially collect tagged data.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a diagram of hardware components;
FIG. 2 is an overall structural view;
FIG. 3 is a schematic overall flow diagram;
FIG. 4 is a model diagram of an algorithm Markov decision process;
FIG. 5 is a flow diagram of a data acquisition and preprocessing module;
FIG. 6 is a schematic diagram of a maneuver evaluation module;
FIG. 7 is a block diagram of a path optimization module based on a task parameterized Gaussian mixture model;
FIG. 8 is a block diagram of a training module of the TD3 based robot 6-DOF object manipulation training system.
Detailed Description
The technical solution of the present invention is described in detail below with reference to the accompanying drawings and examples.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The present invention will be described in detail below with reference to the attached drawings, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without any inventive work belong to the protection scope of the present invention.
Example (b): a TD 3-based mechanical arm object manipulation simulation training method is characterized in that as shown in figure 1, hardware components of an adopted mechanical arm simulation experiment platform comprise a cooperative mechanical arm 1, a six-axis force sensor 2 fixed on a flange plate at the tail end of the cooperative mechanical arm and a tail end electric clamping jaw 3, and cameras 4 and 5 capable of acquiring three-dimensional information and an industrial CCD camera 6 in the middle are carried on two sides of a workbench. In the manipulation task, the cooperative robot arm 1 manipulates an object from the area 7 and conveys the object to the area 8 based on information obtained from the six-axis force sensor 2 and the robot control box feedback information.
The overall system of the TD 3-based robot arm 6-DOF object manipulation training system is shown in FIG. 2 and comprises a data acquisition and preprocessing module, a manipulation pose evaluation module, a path optimization module and a training module.
As shown in fig. 3, the TD 3-based robot arm 6-DOF object manipulation training system comprises the following steps:
step S1, establishing a Markov decision process model of a manipulator task, and determining an environmental state, an action and a return in the manipulation process.
And S2, preprocessing the point cloud acquired by the cameras on the two sides of the workbench by the data acquisition and preprocessing module of the mechanical arm, and fusing the stress data acquired by the force sensor at the tail end of the mechanical arm and the joint angle of the mechanical arm to obtain an environment state S.
And S3, outputting a proper control pose X and a proper waypoint parameter omega by the control pose evaluation module according to the environmental state S obtained by the data acquisition and preprocessing module by utilizing a strategy and value network based on a PointNet framework, and estimating an expected return.
And S4, the path optimization module utilizes the task parameterization Gaussian mixture model to generate control road points according to the control pose X and the road point parameter omega output by the control pose evaluation module.
And S5, controlling the mechanical arm to operate the target object in a proper direction and track by the training module according to the operation waypoint generated by the path optimization module, storing the obtained data in interaction and updating the network, thereby training the operation of the neural network model on objects with different shapes in a proper direction according to the input environment state.
In the present embodiment, the markov decision process is used to model the manipulation process in step S1, and the overall architecture thereof is shown in fig. 4, and the steps thereof are as follows:
step S11, using the target manipulation process to satisfy Markov property, and using an equation (1) to represent the Markov decision process:
M=(S,A,P,R,γ) (1)
in equation (1), S represents the set of all possible states in the environment, including the point cloud C, the robot arm end direction x, and the joint angle in this manipulation model
Figure BDA0004000838030000081
And feedback information d of the gripper, wherein the information fed back by the gripper is whether the gripping jaw is closed, if the gripping action is performed, the gripping jaw is not closedThe full closing indicates that the grabbing is successful, and the full closing indicates that the grabbing fails;
a represents the set of all possible actions performed by the agent, where the behavior in this manipulation model is the estimated manipulation pose x and waypoint parameters ω of the actuator (in this embodiment the waypoint parameters are the manipulation intermediate poses); in this embodiment, in order to prevent the robot arm from colliding with the platform during the movement process, thereby causing damage, the movement range of the robot arm is limited.
P is the state transition probability, i.e. the probability of reaching the state S' after the action A is taken in the state S; representing S' in this steering model by the environmental state parameters observed after taking action a;
r is reward, the invention adopts sparse reward function, after executing the operation action A under the state s, if the operation task is successful, a reward is given according to the operation pose and stability, if the operation task is not completed, the reward is 0. The reward of the mechanical arm simultaneously considers the maneuverability index of the mechanical arm and the operation stability index of the mechanical arm so that the mechanical arm can operate the object in a better direction and with higher operation quality.
Gamma is a discount factor, and gamma belongs to [0,1].
At step S12, the robotic arm and environment interact in a series of discrete time steps (t =0,1,2,3, \ 8230;), represented by a trace:
τ=(S 0 ,A 0 ,R 1 ,S 1 ,A 1 ,R 1 ,S 2 ,A 2 ,R 2 ,…) (2)
step S13, the task of the Markov decision process is to find a control strategy μ: s → A, maximizing the expected maneuver reward G t
Figure BDA0004000838030000091
Using action value function Q μ (s t ,a t ) To describe the expected return for the maneuver under policy μ:
Figure BDA0004000838030000092
step S14, in order to find the optimal strategy mu * Finding the optimal action value function Q is required * (s t ,a t ) Then, the optimal action value function is solved to obtain the optimal strategy mu * The Bellman equation used to process equation (4) has:
Figure BDA0004000838030000093
in formula (5), s t+1 P represents the observed quantity at time t +1, μ(s), observed from the environment t+1 ) Represents the state s under the strategy mu at the moment t +1 t+1 Mapped action a t+1 ,r(s t ,a t ) Is a state s t When taking action a t Reward of time, E (. Sup.) -represents the expected return G under the strategy μ t The mathematical expectation of (2).
In an embodiment, the data acquisition and preprocessing module comprises a depth camera, a mechanical arm control cabinet and a six-axis force sensor which are positioned on two sides of a workbench in the simulation. The process of obtaining the environmental status through the data collection and pre-processing module is shown in fig. 5:
step S21, acquiring a camera C according to camera internal parameters provided by a camera manufacturer i Focal length of (f) x ,f y ) And the optical center (C) x ,C y ) Calibrating camera C by Zhang Zhengyou calibration method i And calibrating the base coordinate system of the mechanical arm to obtain an external parameter matrix P of the mechanical arm relative to the base coordinate system. Camera C i Acquired one-frame depth image
Figure BDA0004000838030000094
Any one pixel (u, v, 1) T And the position (x) of the point relative to the robot base coordinate system c ,y c ,Z c ) T And a mapping relation exists between the two:
Figure BDA0004000838030000095
in the formula (1), the reaction mixture is,
Figure BDA0004000838030000096
for the external parameter matrix P, N cameras C in the working scene can be obtained through the formula (6) i,i∈N Collected frame image->
Figure BDA0004000838030000097
Corresponding three-dimensional point cloud->
Figure BDA0004000838030000098
Step S22, point cloud data acquired by cameras with different depths
Figure BDA0004000838030000099
And splicing to complete dense mapping of the scene, filtering by a voxel network downsampling algorithm, and denoising by Gaussian filtering to obtain a state point cloud C, wherein the depth camera comprises a first camera 4, a second camera 5 and an intermediate industrial CCD camera 6.
Step S23, acquiring the terminal pose X and the joint angle of the robot
Figure BDA0004000838030000101
And feedback information d of the gripper is fused with the point cloud C to serve as an environment state S. Where n denotes the nth joint of the robot, theta i Representing the angle of the ith joint of the robot. />
In the embodiment, the depth camera and the force sensor are connected with the computer through serial ports; the robot arm control cabinet is connected with the computer through TCP/IP communication, and the control cabinet serves as a server and sends signals such as joint angles, terminal poses and the like to the computer serving as a client at a certain frequency. In order to facilitate management and information interaction, different sensors and mechanical arms can be created and controlled in a node terrain mode through the ROS.
In this example, the purpose of voxel down-sampling the stitched point clouds is to obtain more uniform data, because the point clouds collected at different viewing angles have an overlapping phenomenon after being stitched, and the learning of subsequent network parameters is affected by the uneven density of the point clouds. And then, carrying out Gaussian filtering on the point cloud to remove outliers, and forming environment state parameters by combining joint angles, terminal poses and stresses of the robot and clamp feedback information.
In this embodiment, the maneuver evaluating module in step S3 is a maneuver strategy based on TD3 algorithm, and the strategy includes a strategy network μ (· | θ |) μ ) To fit the steering strategy μ: s → A, and two value networks
Figure BDA0004000838030000102
To fit the action value function Q μ (s t ,a t ) And these networks also have a target network mu' (. Cndot. |. Theta.) μ′ )、/>
Figure BDA0004000838030000103
To solve the bootstrap problem.
Further, the network structure of the policy and value network described in step S3 is shown in fig. 6, and the structure is based on the PointNet architecture. The strategy network comprises a point cloud input layer, two affine transformation modules and a plurality of layers of perception machines sharing weights, wherein an input state point cloud C is rotated through one affine transformation module, then the characteristics are extracted point by point through the plurality of layers of perception machines, and the characteristics are extracted through one affine transformation module to carry out maximum pooling operation. And splicing the obtained vector with information such as the current pose of the mechanical arm, the state of the clamp and the like, and outputting a motion instruction at the tail end of the mechanical arm and estimation of expected return after passing through a plurality of full-connection layers.
In this embodiment, the path optimization module in step S4 performs, by using a designed simulated learning algorithm based on a task parameterized gaussian mixture model, control waypoint generation according to the control pose X and waypoint parameters ω acquired in step S3, and controls the robot arm to perform a control task according to the specified waypoints. Compared with other waypoint generating algorithms, the robot has the advantages that the robot can learn a motion mode more conforming to the manipulation habit of a person from the teaching of the person, and different waypoints can be flexibly optimized by changing the waypoint parameter omega and the manipulation direction X. The implementation flow is shown in fig. 7:
step S41, firstly, teaching n operation tracks by human
Figure BDA0004000838030000111
Each data point contains a time step t, and an instantaneous three-dimensional cartesian direction ξ = [ tx = ] of the manipulator arm gripper T ] T . The task parameter P may be expressed as a manipulation direction X and an intermediate pose w corresponding to a reference system +>
Figure BDA0004000838030000112
Wherein A is p Is a rotation matrix of the reference frame p relative to the robot base frame, b p Is a homogeneous matrix of the reference frame p relative to the robot base frame.
Step S42, decoding the collected operation track data by using the reference system P corresponding to the collected operation track to obtain a decoded track
Figure BDA0004000838030000113
Figure BDA0004000838030000114
Figure BDA0004000838030000115
Step S43, learning Gaussian mixture model mu by using EM algorithm 0 Fitting the mapping relation mu from the time step t to the Cartesian direction of the mechanical arm after decoding through a plurality of normal distributions 0 :t→X′ T The expression of the gaussian mixture model is as follows:
Figure BDA0004000838030000116
wherein the content of the first and second substances,
Figure BDA0004000838030000117
represents an mth normal distribution relative to the p reference frame>
Figure BDA0004000838030000118
And &>
Figure BDA0004000838030000119
Mean and variance of a normal distribution are indicated.
Step S44, the learned Gaussian mixture model is converted according to the manipulator arm manipulation pose X and the road point parameter omega output by the manipulation evaluation module, and the mapping relation mu from the time step t to the Cartesian direction of the manipulator arm under the reference system P corresponding to the manipulator arm manipulation pose X and the road point parameter omega is obtained p :t→X T The conversion relationship may be represented by:
Figure BDA00040008380300001110
Figure BDA00040008380300001111
step S45, performing Gaussian mixture regression on the time step t by using the decoded Gaussian mixture model to obtain a regression waypoint
Figure BDA0004000838030000121
In step S5 in this embodiment, the training module drives the robot arm to move according to the optimized manipulation waypoint, performs a manipulation attempt when the robot arm reaches the waypoint end point, and records the trajectory (S) in equation (2) acquired in this motion (S) t ,a t ,r t ,S t+1 ) And training a strategy network and a value network of the TD 3-based manipulation reinforcement learning model by using a random gradient descent method, wherein the method comprises the following steps:
step S51, first, withSampling data in a track tau acquired in the interaction of the simulation environment is (s, a, r, s '), and the target strategy network mu' (. Gamma. Theta.) is utilized μ′ ) The action at state s' is calculated:
a′=μ′((s′|θ μ′ )) (12)
step S52, calculating the target value y based on the thought of the Bellman equation in the similar formula (6)
y=r+γminQ′ i (s′,a′|θ Q′ ) (13)
Step S53, minimizing the estimation value of the value network to the expected return by using a gradient descent algorithm
Figure BDA0004000838030000122
And a target value y ci In order to update value network parameters &>
Figure BDA0004000838030000123
Figure BDA0004000838030000124
Step S54, utilizing the strategy mu (· | theta) μ ) Calculating a of action taken in state s new And using value network computing
Figure BDA0004000838030000125
And maximize with a gradient ascent algorithm>
Figure BDA0004000838030000126
Thereby updating the policy network parameter theta μ
And step S55, updating the target network by adopting a soft updating mode. Introducing a learning rate tau, carrying out weighted average on the old target network parameters and the new network parameters, and then assigning values to the target network:
Figure BDA0004000838030000127
θ μ′ =τθ μ +(1-τ)θ μ′ (16)
the learning process of the 6-DOF object manipulation training system based on the TD3 algorithm is divided into two stages: the simulation phase and the actual manipulation phase, the implementation flow of which is shown in fig. 8:
in the simulation stage, a manipulation simulation platform is built in the mechanical arm simulation software, the arrangement form of the manipulation simulation platform is shown in figure 1, and the acquisition of images acquired by a visual sensor, the direction and the joint angle of the mechanical arm, and feedback data of the force sensor and the clamping jaw in the simulation environment is realized by coding through python. In a simulation environment, the manipulator estimates a manipulation direction X and a waypoint parameter omega by using the pose estimation module, generates a manipulator motion waypoint for manipulation attempt by using the path optimization module, collects a large amount of interaction data in the interaction process with the environment, and finishes the training of an object manipulation reinforcement learning model.
The simulation environment in this embodiment is built in the V-REP. One important reason for collecting data in a simulated environment is that the policy and value network of the TD3 algorithm requires a lot of interaction with the environment to achieve good results, whereas the time cost for interaction in a real environment is high, compared to the efficiency of collecting and learning data in a simulated environment.
In the actual manipulation stage, because the simulation environment and the real environment have certain difference, a domain migration module is required to migrate the model learned in the simulation stage to the real environment, firstly, the model learned in the simulation environment is used for executing a manipulation task in the real environment, and a small amount of real environment training data is collected; and then, carrying out domain migration from the simulation data to the real environment data by using a Gan network to obtain a large amount of generated data.
It should be noted that the above examples are only for illustrating the present invention and are not intended to limit the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention without making creative efforts, fall within the protection scope of the invention.
The above is a specific embodiment of the technical solution of the present invention and the technical principle applied. Any modification, supplement or equivalent replacement by a person skilled in the art on the basis of this embodiment is within the scope of the invention as claimed in the claims.

Claims (10)

1. The TD 3-based robot arm 6-DOF object manipulation training method is characterized by comprising the following steps of:
s1, establishing a Markov decision process model of a manipulator task, and determining an environmental state, an action and a return in a manipulation process;
s2, preprocessing the point cloud acquired by the cameras on the two sides of the workbench by a mechanical arm data acquisition and preprocessing module, and fusing stress data acquired by a mechanical arm terminal force sensor and joint angles of the mechanical arm to obtain an environment state S;
s3, outputting a proper control pose X and a proper waypoint parameter omega by a control pose evaluation module according to the environmental state S obtained by the data acquisition and preprocessing module by utilizing a strategy and value network based on a PointNet framework, and estimating an expected return;
s4, the path optimization module utilizes a task parameterized Gaussian mixture model to generate operation path points according to the operation pose X and the path point parameter omega output by the operation pose evaluation module;
and S5, controlling the mechanical arm to operate the target object in a proper direction and track by the training module according to the operation waypoint generated by the path optimization module, storing the obtained data in interaction and updating the network, thereby training the operation of the neural network model on objects with different shapes in a proper direction according to the input environment state.
2. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the grabbing process is modeled by a Markov decision process in step S1, and the steps are as follows:
step 1.1, using the target grabbing process to satisfy Markov property, and using the following formula to represent the Markov decision process:
M=(S,A,P,R,γ)
in the formula, S represents a set of all possible states in the environment, and the states in the grasping model are fusion data of three-dimensional point cloud collected by a camera and the tail end gesture of the robot; a represents the set of all possible actions executed by the agent, and the action is the pose X with six degrees of freedom of the tail end of the actuator in the grasping model; p is the state transition probability, i.e. the probability of reaching state s' after taking action A in state s; s' is represented in this gripping model by the environmental state parameters observed after taking action a; r is reward, a sparse reward function is adopted, after grabbing action A is executed in a state s, if grabbing is successful, a reward is given according to grabbing posture and stability, if grabbing is not completed, the reward is 0, gamma is a discount factor, and gamma belongs to [0,1];
step 1.2, the robot and environment interact in a series of discrete time steps (t =0,1,2, 3..) represented by a trajectory:
τ=(S 0 ,A 0 ,R 0 ,S 1 ,A 1 ,R 1 ,S 2 ,A 2 ,R 2 ,…S t ,A t ,R t )
in the formula, S t ,A t ,R t Respectively representing the state, the action and the reward in the Markov decision process at t moment;
step 1.3, finding a control strategy mu in the task of capturing reinforcement learning; s → A, maximizing the expected capture reward G t
Figure QLYQS_1
Using action value function Q μ (s t ,a t ) To describe the expected return captured under policy μ:
Figure QLYQS_2
in the formula, Q μ (s t ,a t ) Delegate policies μ Action value function of μ [G t |S t =s t ,A t =a t ]Represents a state s t When taking action a t Time, expected reward G under policy μ t A mathematical expectation of (d);
step 1.4, in order to find the optimal strategy mu * Finding the optimal action value function Q is required * (s t ,a t ) Then solving the optimal action value function to obtain the optimal strategy mu * Processing with Bellman equation:
Figure QLYQS_3
in the formula, s t+1 P represents the observed quantity at time t +1, μ(s), observed from the environment t+1 ) Represents the state s under the strategy mu at the moment t +1 t+1 Mapped action a t+1 ;r(s t ,a t ) Is a state s t When taking action a t Reward in hours, E (. + -.) denotes the expected return G under the strategy μ t The mathematical expectation of (2).
3. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 2, wherein the set S of all possible states in the environment comprises environment state point cloud C, robot end pose X and joint angle
Figure QLYQS_4
And feedback information d of the gripper, wherein n represents the nth joint of the robot, theta i Representing the angle of the ith joint of the robot.
4. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein in the step S2, the data acquisition and preprocessing module is adopted to perform splicing, noise reduction and filtering processing on the point clouds acquired by the plurality of depth cameras and to fuse multi-sensor information to obtain the environmental state parameter S, and the steps are as follows:
step 2.1, acquiring a camera C according to camera internal parameters provided by a camera manufacturer i Focal length of (f) x ,f y ) And optical center (C) x ,C y ) By marking the camera C by Zhang Zhengyou i Calibrating with a robot base coordinate system to obtain an external parameter matrix P and a camera C of the robot base coordinate system i Acquired one-frame depth image
Figure QLYQS_5
Any one pixel point (u, v, 1) T And the position (x) of the point relative to the robot base coordinate system c ,y c ,Z c ) T And a mapping relation exists between the two:
Figure QLYQS_6
in the formula (I), the compound is shown in the specification,
Figure QLYQS_7
for the external parameter matrix P, N cameras C in the working scene can be obtained by the formula i,i∈N One frame of collected image
Figure QLYQS_8
Corresponding three-dimensional point cloud data->
Figure QLYQS_9
Step 2.2, three-dimensional point cloud data acquired by cameras with different depths
Figure QLYQS_10
Splicing to complete dense mapping of the scene, filtering by a voxel network downsampling algorithm, and denoising by Gaussian filtering to obtain an environment state point cloud C;
step 2.3, taking the terminal pose X and the joint angle of the robot
Figure QLYQS_11
And feedback information d of the holder is fused with the environmental point cloud C to be used as an environmental state S.
5. The TD 3-based manipulator 6-DOF object manipulation training method according to claim 1, wherein the strategy and value network in step S3 is based on PointNet architecture, and the strategy network comprises a point cloud input layer, two affine transformation modules and a plurality of layers of sensors sharing weights, wherein an input state point cloud C is rotated through one affine transformation module, then the characteristics are extracted through the plurality of layers of sensors, the characteristics are extracted through one affine transformation module and subjected to maximum pooling operation, the obtained vector is spliced with the current pose of the manipulator and the holder state information, and the motion instruction of the end of the manipulator and the estimation of the expected return are output through a plurality of fully connected layers.
6. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the manipulation waypoint generation in step S4 is specifically:
firstly, a person teaches n grabbing tracks
Figure QLYQS_12
Each data point contains the time step t, the instantaneous three-dimensional cartesian position and pose ξ = [ tx = [ t X ] of the robot gripper T ] T And a reference system for objects to be grabbed>
Figure QLYQS_13
And training a Gaussian mixture model mu 0 Obtaining the mapping relation mu from the time step t of the current object reference system P to the pose X of the robot terminal 0 :t→X T And generating a control path according to a reference system P corresponding to the control pose X and the waypoint parameter omega, wherein A p ,b p Rotation matrix 3x3 and translation relative to robot base coordinate system for grabbing object reference systemMatrix 3x1, <' > based on>
Figure QLYQS_14
Representing a set of p object reference frames.
7. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the TD3 algorithm-based deep reinforcement learning capture strategy of step S5 specifically comprises: using a policy network mu (· | theta) μ ) To fit the grabbing strategy μ: s → A, and two value networks
Figure QLYQS_15
Figure QLYQS_16
To fit the action value function Q μ (s t ,a t ) And these networks are also provided with target network μ' (. The μ′ )、/>
Figure QLYQS_17
To solve the bootstrapping problem, where μ (· | θ) μ ) Representing a network parameter of theta u The policy network of (1).
8. The TD 3-based robot arm 6-DOF object manipulation training method according to claim 1, wherein the learning process is divided into two stages: a simulation phase and an actual steering phase. In the simulation stage, a manipulation simulation platform is mainly built in robot simulation software, and the grabbing model training is carried out by utilizing the training process of claim 1. In the actual manipulation stage, because the simulation environment has a certain difference from the real environment, the domain migration module needs to migrate the model learned in the simulation stage to the real environment by using the Gan network.
9. The mechanical arm 6-DOF object manipulation training system based on TD3 is characterized by comprising a cooperative mechanical arm (1), a six-axis force sensor (2) fixed on a flange plate at the tail end of the cooperative mechanical arm and a power-driven clamping jaw (3) at the tail end of the cooperative mechanical arm, wherein a first camera (4), a second camera (5) and an industrial CCD camera (6) in the middle are carried on two sides of a workbench, the first camera and the second camera can acquire three-dimensional information, and the cooperative mechanical arm (1) manipulates an object from a starting point area (7) and carries the object to an end point area (8) according to information acquired by the six-axis force sensor (2) and feedback information of a robot control box in a manipulation task; the system also comprises a data acquisition and preprocessing module, an operation pose evaluation module, a path optimization module and a training module.
10. The TD3 based robotic 6-DOF object manipulation training system of claim 9, wherein: the mechanical arm data acquisition and preprocessing module is used for preprocessing point clouds acquired by a plurality of cameras positioned on two sides of the workbench and fusing stress data acquired by a mechanical arm terminal force sensor and joint angles of the mechanical arm to obtain an environment state S;
the manipulation pose evaluation module outputs a proper manipulation pose X and a proper waypoint parameter omega by utilizing a strategy and value network based on a PointNet framework according to the environmental state S obtained by the data acquisition and preprocessing module, and estimates an expected return;
the path optimization module utilizes a task parameterization Gaussian mixture model to generate control path points according to the control pose X and the path point parameter omega output by the control pose evaluation module;
the training module controls the mechanical arm to operate the target object in a proper direction and track according to the operation waypoint generated by the path optimization module, stores the obtained data in interaction and updates the network, so that the neural network model is trained to operate objects in different shapes in a proper direction according to the input environment state.
CN202211612997.7A 2022-12-15 2022-12-15 TD 3-based manipulator 6-DOF object manipulation training method and system Pending CN115890744A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211612997.7A CN115890744A (en) 2022-12-15 2022-12-15 TD 3-based manipulator 6-DOF object manipulation training method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211612997.7A CN115890744A (en) 2022-12-15 2022-12-15 TD 3-based manipulator 6-DOF object manipulation training method and system

Publications (1)

Publication Number Publication Date
CN115890744A true CN115890744A (en) 2023-04-04

Family

ID=86494711

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211612997.7A Pending CN115890744A (en) 2022-12-15 2022-12-15 TD 3-based manipulator 6-DOF object manipulation training method and system

Country Status (1)

Country Link
CN (1) CN115890744A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117313546A (en) * 2023-10-26 2023-12-29 北京大学 Trusted smart hand system simulation method and simulation system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117313546A (en) * 2023-10-26 2023-12-29 北京大学 Trusted smart hand system simulation method and simulation system

Similar Documents

Publication Publication Date Title
CN110000785B (en) Agricultural scene calibration-free robot motion vision cooperative servo control method and equipment
CN111251294A (en) Robot grabbing method based on visual pose perception and deep reinforcement learning
CN111421554B (en) Mechanical arm intelligent control system, method and device based on edge calculation
JPWO2003019475A1 (en) Robot device, face recognition method, and face recognition device
CN111331607B (en) Automatic grabbing and stacking method and system based on mechanical arm
CN114912287A (en) Robot autonomous grabbing simulation system and method based on target 6D pose estimation
CN113172629A (en) Object grabbing method based on time sequence tactile data processing
Tang et al. Learning collaborative pushing and grasping policies in dense clutter
CN112734823B (en) Image-based visual servo jacobian matrix depth estimation method
CN115890744A (en) TD 3-based manipulator 6-DOF object manipulation training method and system
CN115464659A (en) Mechanical arm grabbing control method based on deep reinforcement learning DDPG algorithm of visual information
Iqbal et al. Toward sim-to-real directional semantic grasping
CN116276998A (en) Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration
CN116460843A (en) Multi-robot collaborative grabbing method and system based on meta heuristic algorithm
Eppner et al. Imitation learning with generalized task descriptions
CN113681552B (en) Five-dimensional grabbing method for robot hybrid object based on cascade neural network
CN117001675B (en) Double-arm cooperative control non-cooperative target obstacle avoidance trajectory planning method
Zhao et al. Real-time object detection and robotic manipulation for agriculture using a yolo-based learning approach
Leite et al. Adaptive 3D visual servoing without image velocity measurement for uncertain manipulators
CN109542094B (en) Mobile robot vision stabilization control without desired images
CN115194774A (en) Binocular vision-based control method for double-mechanical-arm gripping system
CN114998573A (en) Grabbing pose detection method based on RGB-D feature depth fusion
CN112857373B (en) Energy-saving unmanned vehicle path navigation method capable of minimizing useless actions
CN114888768A (en) Mobile duplex robot cooperative grabbing system and method based on multi-sensor fusion
Su et al. Nonlinear visual mapping model for 3-D visual tracking with uncalibrated eye-in-hand robotic system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination