CN116276998A

CN116276998A - Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration

Info

Publication number: CN116276998A
Application number: CN202310250707.7A
Authority: CN
Inventors: 张伟; 褚石磊; 杨硕; 宋然; 李大猷; 魏鹏坤; 赵晨琨
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-06-23

Abstract

The invention provides a robot arm grabbing method and a robot arm grabbing system based on reinforcement learning and free of hand and eye calibration, and relates to the technical field of reinforcement learning, wherein the method comprises the steps of constructing an intelligent agent interaction simulation environment and defining a state value of the simulation environment; acquiring state data of the intelligent body, the gesture of a target object to be grabbed and identifying coordinates of grabbing points; adopting an optimal closed loop grabbing strategy, enabling an intelligent body to perceive information of a current environment, calculating optimal grabbing actions according to the environment information, after executing the actions, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent. The method and the device can omit tedious steps of hand-eye calibration.

Description

Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration

Technical Field

The disclosure relates to the technical field of reinforcement learning, in particular to a hand-eye calibration-free mechanical arm grabbing method and system based on reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Along with the development of the current industries such as logistics, unmanned factories and the like, the industry has put forward higher requirements on the practicability and the intellectualization of the mechanical arm grabbing algorithm. The automatic grabbing of the mechanical arm is based on the traditional teaching control and the fixed path moving capability, and the intelligent capability of performing target detection, environment sensing, path planning and execution through the visual sensor and the computing module is developed. Under the scene that the robot arm needs to be matched with the vision sensor for grabbing, the hand-eye calibration of the robot arm is an important link, and the space coordinate conversion relation between the robot arm and the vision sensor is obtained through sampling and calculation, so that the mapping from a world coordinate system to an image coordinate system is realized. The general steps of traditional hand-eye calibration are: firstly, a model of a hand-eye relation is obtained, and a basic coordinate system, an end coordinate system, a calibration plate coordinate system and a camera coordinate system are defined; the second step is to calculate the conversion relation between each coordinate system; finally, a calibrated hand-eye model is obtained. However, the conventional hand-eye calibration method has some problems that complicated steps are required, more manual intervention is required in the sampling process, more errors are brought, and accumulated errors become larger along with the use of the system. Whatever the method of hand-eye calibration, it is not as accurate as the fit of a human hand and eye.

The inventors found that hand-eye calibration has the following disadvantages:

1) The accuracy of the hand-eye calibration is limited by the skill level of the operator, the higher the calibration accuracy.

2) The process of hand-eye calibration is complicated, and an operator is required to carry out according to certain steps, so that the time consumption is long.

3) The precision of hand-eye calibration is greatly affected by the environment, such as illumination, temperature, etc.

4) The precision of hand-eye calibration is greatly influenced by the precision of the robot, such as the precision, stability and the like of the robot.

5) The precision of hand-eye calibration is greatly affected by the calibration object, such as the precision, stability and the like of the calibration object.

Disclosure of Invention

In order to solve the problems, the invention provides a hand-eye calibration-free mechanical arm grabbing method and a hand-eye calibration-free mechanical arm grabbing system based on reinforcement learning, an end-to-end machine learning model is designed by means of a deep reinforcement learning algorithm, and the hand-eye calibration-free mechanical arm grabbing system imitates human hand-eye coordination logic, so that an intelligent body can grab a target object autonomously under a continuously transformed visual angle, and mechanical arm grabbing without hand-eye calibration can be realized.

According to some embodiments, the present disclosure employs the following technical solutions:

the manipulator grabbing method for hand-eye calibration free based on reinforcement learning is characterized by comprising the following steps of:

constructing an intelligent agent interactive simulation environment, and defining a state value of the simulation environment for representing state information of the intelligent agent and the environment;

acquiring state data of the intelligent body, the gesture of a target object to be grabbed and identifying coordinates of grabbing points;

adopting an optimal closed loop grabbing strategy, enabling an intelligent body to perceive information of a current environment, calculating optimal grabbing actions according to the environment information, after executing the actions, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.

the manipulator grabbing system based on reinforcement learning and free of hand-eye calibration comprises an intelligent body and an intelligent simulation platform, wherein the intelligent body comprises a manipulator required by executing tasks; the agent simulation platform comprises:

the model construction module is used for constructing an intelligent agent interactive simulation environment, defining a state value of the simulation environment and representing state information of the intelligent agent and the environment;

the initialization module is used for acquiring state data of the intelligent body and the gesture of the target object to be grabbed, and identifying coordinates of grabbing points;

the action execution module is used for adopting an optimal closed loop grabbing strategy, enabling an intelligent agent to perceive information of the current environment, calculating an optimal grabbing action according to the environment information, after the action is executed, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.

Compared with the prior art, the beneficial effects of the present disclosure are:

the method is based on a PPO algorithm, a Proximal strategy optimization method is used, a strategy gradient and importance sampling-based line learning strategy is used for predicting actions by inputting a 'state', and training is obtained in iteration with the environment, so that certain capability can be learned. Compared with the traditional robot control method, the sensitivity degree of reinforcement learning to parameters is lower than that of the traditional control method; reinforcement learning has advantages for difficult-to-model system control; reinforcement learning is a strategy, not simple parameter adjustment, so that beyond human preset capabilities can be learned. The use of reinforcement learning can help the robotic arm accomplish a particular task without requiring complex manual programming. The reinforcement learning does not need training data, and the cost of collecting training samples is saved by learning through interaction with the environment.

The end-to-end method provided by the present disclosure can be trained once and deployed for multiple times, so that the complicated step of hand-eye calibration is omitted; the new idea forms a closed loop between the hand-eye relationships, thus also solving the errors resulting from the accumulation. Different sensor conditions can be compatible through designing the corresponding input layer, so that a grabbing strategy with convenient deployment and reliable success rate is realized, and the grabbing strategy can be applied to mechanical arms in the scenes of logistics, assembly and the like. And meanwhile, the method is also one search for the application of reinforcement learning, and the reinforcement learning is proved to be feasible in specific applications.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a simulated reinforcement learning environment of an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a comparative learning structure according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a teacher and student network in accordance with an embodiment of the present disclosure;

FIG. 4 is a teacher network flow chart of an embodiment of the present disclosure;

fig. 5 is a student network flow diagram of an embodiment of the present disclosure.

The specific embodiment is as follows:

the disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Example 1

An embodiment of the present disclosure provides a method for grabbing a manipulator arm based on reinforcement learning and free of hand-eye calibration, including:

step one: constructing an intelligent agent interactive simulation environment, and defining a state value of the simulation environment for representing state information of the intelligent agent and the environment;

step two: acquiring state data of the intelligent body, the gesture of a target object to be grabbed and identifying coordinates of grabbing points;

step three: adopting an optimal closed loop grabbing strategy, enabling an intelligent body to perceive information of a current environment, calculating optimal grabbing actions according to the environment information, after executing the actions, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.

As an embodiment, the mechanical arm grabbing uses reinforcement learning for training, and the task is to enable the intelligent body to grab the target object autonomously under the continuously transformed view angle, so that the mechanical arm grabbing without hand-eye calibration can be realized, firstly, an intelligent body interactive simulation environment is constructed, and the states of all the objects in the interactive simulation environment are defined;

first, the appropriate type of robotic arm, end effector, and sensor needs to be selected. The model of the mechanical arm is UR5, which is designed and produced by the company Universal robot of Denmark, has 6 degrees of freedom, a payload of 5 kg, a working radius of 850 mm and repeated positioning accuracy of +/-0.1 mm, and is suitable for light assembly, pick-up, placement, grinding and polishing and other scenes. The model of the end effector is Robotiq manipulator 85, which is an adaptive robotic Gripper that can be used to grasp robots and the like for a variety of applications. The model of the vision sensor is Kinect, which is a 3D camera produced by Microsoft corporation, integrates a depth camera and a color camera, and is suitable for advanced computer vision tasks.

The sensor is a means for the agent to perceive the environment, and in this task, the sensor is composed of multiple Kinect cameras assumed to be in different positions. The base position of the sensor is 0.5m from the ground, the pitch angle is 40 degrees, and the sensor faces the intelligent body and a plurality of positions of the working area, as shown in figure 1, and comprises the positions of two Kinect sensors. The sensor is used for randomly adjusting the position and the posture in a certain range on the basis of the position according to a certain probability distribution and acquiring richer visual angle data. The sensor is arranged to collect a depth map and RGB color images with a respective rate of 640 x 480 as input for subsequent steps.

Secondly, an environment, simulation or real environment is defined, which consists of an agent and other physical elements, and can accept certain actions and produce corresponding results. In the task, the environment is a Coppelaiasim virtual simulation environment running in a computer, and as shown in fig. 1, the main functions of the simulation environment comprise simulation of a mechanical arm, simulation of accessories such as a sensor and a sucker, and physical simulation.

Another important function of the simulation environment is the interactive method provided for the external program to use. Coppelisisim provides a variety of interfaces, which are used in the present invention to provide interfaces and libraries for the Python language, and some of the functions in the simulation environment are implemented by Lua scripts.

After determining the environment, the state needs to be defined next, and the state is a variable in the environment and is the current state of the environment. In the task, the state comprises the tail end gesture of the mechanical arm, the data of each joint of the mechanical arm, the gesture of a target object to be grabbed and the coordinates of grabbing points, and the data of a sensor.

The definition of each state in this task is as follows:

1. the end pose of the mechanical arm, including the position and pose information, is represented by a 3D position and quaternion.

2. The data of each joint of the mechanical arm, including the angle and the angular velocity, are represented by a vector.

3. The pose of the object to be targeted, including the position and pose information, is represented by a 3D position and quaternion.

4. The coordinates of the grabbing points are represented by a 3D position.

5. The data of the sensor, including the depth image and the color image, are represented by image data.

Next, the definition of the behavior of the agent in the present task will be described, where the definition of the behavior of the agent in the present task is:

wherein Deltax is _i 、Δy _i 、Δz _i Respectively represent the offset of the tail end of the mechanical arm in the directions of three coordinate axes,

indicating the rotation angle of the wrist motor of the robot arm, i.e. controlling the rotation angle of the end effector about the z-axis.

As an embodiment, in the second step, the state data of the agent, the gesture of the target object to be grabbed, and the coordinates of the grabbing point are acquired.

In the teacher network training stage, state data of the intelligent agent, the gesture of the target object to be grabbed and the grabbing point are directly read out through an interface of the simulation environment and used as privilege information. In the training and reasoning stage of the student network, the state of the intelligent body is deduced through the kinematics of the mechanical arm and the joint information of the mechanical arm; the gesture and the grabbing point of the target object to be grabbed are obtained through implicit detection of a deep neural network. The planning process is to calculate the optimal action according to the environment information, execute the action and perceive again to obtain new environment information, and iterate until the target object is successfully grabbed.

In the third step, an optimal closed loop grabbing strategy is adopted, an intelligent agent perceives information of the current environment, an optimal grabbing action is calculated according to the environment information, new environment information is obtained after the action is executed and perceived again, and the process is repeated until a target object is successfully grabbed; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.

A suitable reinforcement learning algorithm and structure is selected. Proximal Policy Optimization (PPO) is a reinforcement learning algorithm, which is one of the more widely used algorithms at present. The core idea of the PPO algorithm is to maximize the benefit of the policy through limited steps while ensuring that the magnitude of the change in the policy is not too great. The PPO algorithm has the advantages of simplicity, easiness in implementation and good efficiency. The PPO algorithm may also be applicable to reinforcement learning tasks in a variety of different environments. To achieve this goal, the PPO algorithm uses a policy optimization method called Proximal, which consists of two parts: policy updating: the policy is updated by using an approximation of the policy gradient. Constraint optimization: the change amplitude of the strategy is limited by setting a constraint, so that the stability of the strategy is ensured.

The closed loop grabbing strategy is optimized by adopting a near-end strategy, comprises strategy updating and constraint optimization, and comprises the steps of updating the strategy by using an approximation of strategy gradient and setting a constraint to limit the change amplitude of the strategy.

The optimization process mainly comprises two steps: experience is collected and policies are updated.

In the experience collection phase, the agent interacts with the environment according to the current strategy, thereby generating some track and state transition samples. These samples may be used to update policies. In the update strategy phase, the algorithm optimization objective is to maximize the expected return of the experience trace. The PPO algorithm uses two different policy updating methods, one single update and one multiple updates. The single update method includes two steps: an objective function of the policy update is calculated and the policy parameters are updated using a method called projection gradient descent. In calculating the objective function, PPO uses a technique called Clipped Surrogate Objective to control the magnitude of the update by limiting the magnitude of the change in the policy parameters for each update. The size of this constraint is called update amplitude constraint (clip parameter). The multiple update method updates the policy by repeating a single update, each update using a new sample, and keeping the current policy unchanged. After each update, the algorithm saves the parameters of the current policy as old policy parameters in order to roll back to the previous state when needed. In general, the optimization process of the PPO algorithm is iterative. By continuously collecting experience and updating strategies, the algorithm gradually learns the optimal strategy parameters.

The traditional capture strategy is to detect a single frame, obtain a path and then execute the path, which is disadvantageous in that the frame used as calculation is not necessarily from the most clear and easily distinguished angles and opportunities, and the accuracy is very limited. Humans require constant attention from the eye while grabbing an object until the hand grabs the object. The advantage of this is that the whole process is closed-loop, and the closed-loop gripping strategy is a method for the robot to grip the object, wherein the gripping behavior and the perceived result of the robot are mutually influenced.

As an embodiment, in the closed loop grabbing strategy, the robot first perceives, acquires the information of the current environment, then calculates the optimal grabbing action according to the information, and then, the robot executes the action and perceives again to acquire the new environment information. This process is repeated until the robot successfully grips the item. The closed loop grabbing strategy has the advantages that the robot can adjust own actions according to real-time sensing results, so that grabbing success rate is improved.

And simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.

Specifically, the reward mechanism is considered as a failure of the grabbing task when the mechanical arm of the intelligent body collides, exceeds the operation range or exceeds the maximum limiting step length, and the execution is terminated. When the intelligent agent fails to complete the grabbing task for a long time, rewards are generated to prompt the intelligent agent to complete the grabbing task rapidly.

Rewarding the action of the intelligent agent by the task, wherein the rewarding of the task is as follows:

R＝R _{task completion} +R _{Failure of} +R _{Survival time}

R _{Survival time} ＝step*α，α<0。

When the intelligent mechanical arm collides, exceeds the operation range or exceeds the maximum limiting step length, the task is considered to be failed, and the execution of the grabbing task is terminated. When the intelligent agent fails to complete the task for a long time, a responsible reward is generated, so that the intelligent agent is urged to complete the task as soon as possible. Finally, the termination conditions of the execution of the agent are defined, and the termination conditions are as follows:

1) The mechanical arm successfully grabs the object;

2) The mechanical arm runs beyond a specified range;

3) The mechanical arm collides with the environment;

4) The running step size exceeds 100 steps.

Before the execution of the grabbing task, the grabbing execution strategy is optimized and trained, and contrast learning and privilege learning are introduced, so that the distance between the data acquired by the current intelligent agent and the data at different angles at the same moment is reduced, the distance between the data acquired by the current intelligent agent and the data at different angles at different moments is increased, and the intelligent agent can continuously acquire the relative relation between the mechanical arm and the target object to be grabbed.

As shown in fig. 3, training is largely divided into two processes: training of a teacher's network and training of a student's network. Firstly, training of a teacher network is performed, and the input of the training of the teacher network is privilege information and the output of the previous frame.

Privilege information refers to environmental variables that can be read in a simulation environment but cannot be obtained by a robot or a sensor, and in this task, the pose of a mechanical arm, the pose of an object to be grasped, and the position of the sensor are regarded as privilege information.

Although the robot arm pose is readable through the API of the robot arm, it is also used herein as privilege information. The privilege information and the history output are input into the multi-layer perceptron to obtain a feature vector containing the required information, and the feature vector is used as a real state of reinforcement learning. The intelligent agent performs the output action to interact with the environment, so that environmental rewards are obtained and are used for evaluating the action under the current state. Through repeated iterative training, the teacher network can grasp through the privilege information.

The training of the student network is next, and the student network cannot adopt privilege information which cannot be normally acquired because the student network is applied and tested. The input of the student network comes from the sensor, and the depth network ResNet50 is adopted for data dimension reduction because the dimension of the image data read from the sensor is too high to be processed by using the multi-layer perceptron in the teacher network.

The next steps are consistent with the teacher network. Migrating data from the teacher's network to the student's network, three operations are employed here:

supervision of feature vectors, sharing of multi-layer perceptron parameters, and supervision of motion outputs.

The first step: and (5) supervision of the feature vector. First, the output of the teacher network is used as a feature vector to monitor the student network, so that the output of the student network is close to the output of the teacher network.

And a second step of: sharing of multi-layer perceptron parameters. In order to improve training efficiency, parameters of the multi-layer perceptron of the teacher network are shared to the student network.

And a third step of: and (5) supervision of action output. In the training process of the student network, the action output by the teacher network is used as supervision to help the student network gradually learn the behavior strategy of the teacher network.

Through these three steps, the teacher's network's knowledge and experience is effectively migrated to the student's network. In the common supervision and iteration process, the student network gradually and independently learns, and finally, the independent operation capability is achieved. At this time, the teacher network may be removed and the student network may be operated alone.

After common iteration and supervision of the teacher network and the student network, the final student network has the capability of independent operation after convergence. At this time, the teacher network can be removed and the student network can be used alone.

In order to further improve the training effect, contrast learning is introduced in the training process. Contrast learning is a machine learning technique that can help a machine learning system learn patterns from given training data to better predict unknown data. The basic idea is that by comparing different training samples, the machine learning system can learn better patterns and thus predict the unknown data better. In reinforcement learning, contrast learning may be used to train agents to obtain optimal strategies.

The specific implementation method is as follows: one sample at a time and one positive sample of the sample and one negative sample of the sample. The positive samples are selected from the data from different sensors at the same time and the negative samples are the sensor data at different times. The purpose of this arrangement is to allow the agent to notice more quickly the commonality at different viewing angles at the same time. Because the relative relationship of the robotic arm and the object to be grasped is consistent at different viewing angles, there is a difference at different times.

The use of a contrast learned structure and ternary loss function is shown in fig. 2. Making the distance between the current image and the positive sample smaller; the distance between the current image and the negative sample is made larger. In this process, the agent can more quickly notice the relative relationship between the robotic arm and the object to be grasped.

loss＝max(0,dist(A,P)-dist(A,)+margin)

Wherein A is current image data; p is a positive sample from different sensors at the same time; n is a negative sample from different moments; maegan is a threshold.

Example 2

An embodiment of the disclosure provides a robot arm grabbing system based on reinforcement learning free hand-eye calibration, which comprises an intelligent body and an intelligent simulation platform, wherein the intelligent body comprises a robot arm required for executing tasks; the agent simulation platform comprises:

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims

1. The manipulator grabbing method for hand-eye calibration free based on reinforcement learning is characterized by comprising the following steps of:

constructing an intelligent agent interactive simulation environment and defining a state value of the simulation environment;

2. The robot arm grabbing method based on reinforcement learning free hand-eye calibration according to claim 1, wherein the state data of the intelligent body are the tail end gesture of the robot arm, the joint data of the robot arm, the gesture of the object to be grabbed, the coordinates of the grabbing point and the data of the sensor.

3. The method for gripping a robot arm based on reinforcement learning free hand-eye calibration according to claim 1, wherein the reward mechanism is that the robot arm is considered to be failed in gripping task when collision, exceeding the operation range or exceeding the maximum limiting step length occurs, and the execution is terminated.

4. The robot arm grabbing method based on reinforcement learning free hand-eye calibration of claim 3, wherein when the intelligent agent fails to complete the grabbing task for a long time, a reward is generated to prompt the intelligent agent to complete the grabbing task quickly.

5. The robot arm grabbing method based on reinforcement learning free hand-eye calibration of claim 1, wherein when the object is successfully grabbed by the intelligent agent, the grabbing task is terminated.

6. The reinforcement learning-based hand-eye calibration-free robotic arm grasping method of claim 1, wherein the closed loop grasping strategy is optimized with a near-end strategy, comprising strategy updating and constraint optimization, comprising updating the strategy by using an approximation of the strategy gradient, and setting a constraint to limit the magnitude of the variation of the strategy.

7. The method for grabbing the manipulator based on reinforcement learning and free of hand and eye calibration according to claim 6, wherein the comparison learning and the privilege learning are introduced while the closed loop grabbing strategy is optimized, so that the distance between the data acquired by the current agent and the data with different visual angles at the same time is reduced, the distance between the data acquired by the current agent and the data with different visual angles at different times is increased, and the relative relation between the manipulator and the object to be grabbed can be continuously acquired by the agent.

8. The robot arm grabbing method based on reinforcement learning and free of hand and eye calibration according to claim 7, wherein training is divided into two processes, namely training of a teacher network and training of a student network, wherein input of the training of the teacher network is privilege information, the privilege information refers to the posture of the robot arm, the posture of a target object to be grabbed and the position of a sensor, executing actions of an intelligent body are output, the intelligent body executes the output actions to interact with the environment, the training is iterated for a plurality of times, and the teacher network grabs through the privilege information.

9. The robot arm grabbing method based on hand-eye calibration of reinforcement learning of claim 8, wherein the training input of the student network is externally read image data, the externally read image data is utilized for training, then the data of the teacher network is migrated into the student network, and the student network is finally used as a using network through common iteration and supervision of the teacher network and the student network.

10. The manipulator grabbing system based on reinforcement learning and free of hand-eye calibration is characterized by comprising an intelligent body and an intelligent simulation platform, wherein the intelligent body comprises a manipulator required by executing tasks; the agent simulation platform comprises:

the model construction module is used for constructing an intelligent agent interactive simulation environment and defining a state value of the simulation environment;