CN116276998A - Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration - Google Patents

Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration Download PDF

Info

Publication number
CN116276998A
CN116276998A CN202310250707.7A CN202310250707A CN116276998A CN 116276998 A CN116276998 A CN 116276998A CN 202310250707 A CN202310250707 A CN 202310250707A CN 116276998 A CN116276998 A CN 116276998A
Authority
CN
China
Prior art keywords
grabbing
agent
reinforcement learning
hand
environment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310250707.7A
Other languages
Chinese (zh)
Inventor
张伟
褚石磊
杨硕
宋然
李大猷
魏鹏坤
赵晨琨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202310250707.7A priority Critical patent/CN116276998A/en
Publication of CN116276998A publication Critical patent/CN116276998A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)

Abstract

The invention provides a robot arm grabbing method and a robot arm grabbing system based on reinforcement learning and free of hand and eye calibration, and relates to the technical field of reinforcement learning, wherein the method comprises the steps of constructing an intelligent agent interaction simulation environment and defining a state value of the simulation environment; acquiring state data of the intelligent body, the gesture of a target object to be grabbed and identifying coordinates of grabbing points; adopting an optimal closed loop grabbing strategy, enabling an intelligent body to perceive information of a current environment, calculating optimal grabbing actions according to the environment information, after executing the actions, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent. The method and the device can omit tedious steps of hand-eye calibration.

Description

Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration
Technical Field
The disclosure relates to the technical field of reinforcement learning, in particular to a hand-eye calibration-free mechanical arm grabbing method and system based on reinforcement learning.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Along with the development of the current industries such as logistics, unmanned factories and the like, the industry has put forward higher requirements on the practicability and the intellectualization of the mechanical arm grabbing algorithm. The automatic grabbing of the mechanical arm is based on the traditional teaching control and the fixed path moving capability, and the intelligent capability of performing target detection, environment sensing, path planning and execution through the visual sensor and the computing module is developed. Under the scene that the robot arm needs to be matched with the vision sensor for grabbing, the hand-eye calibration of the robot arm is an important link, and the space coordinate conversion relation between the robot arm and the vision sensor is obtained through sampling and calculation, so that the mapping from a world coordinate system to an image coordinate system is realized. The general steps of traditional hand-eye calibration are: firstly, a model of a hand-eye relation is obtained, and a basic coordinate system, an end coordinate system, a calibration plate coordinate system and a camera coordinate system are defined; the second step is to calculate the conversion relation between each coordinate system; finally, a calibrated hand-eye model is obtained. However, the conventional hand-eye calibration method has some problems that complicated steps are required, more manual intervention is required in the sampling process, more errors are brought, and accumulated errors become larger along with the use of the system. Whatever the method of hand-eye calibration, it is not as accurate as the fit of a human hand and eye.
The inventors found that hand-eye calibration has the following disadvantages:
1) The accuracy of the hand-eye calibration is limited by the skill level of the operator, the higher the calibration accuracy.
2) The process of hand-eye calibration is complicated, and an operator is required to carry out according to certain steps, so that the time consumption is long.
3) The precision of hand-eye calibration is greatly affected by the environment, such as illumination, temperature, etc.
4) The precision of hand-eye calibration is greatly influenced by the precision of the robot, such as the precision, stability and the like of the robot.
5) The precision of hand-eye calibration is greatly affected by the calibration object, such as the precision, stability and the like of the calibration object.
Disclosure of Invention
In order to solve the problems, the invention provides a hand-eye calibration-free mechanical arm grabbing method and a hand-eye calibration-free mechanical arm grabbing system based on reinforcement learning, an end-to-end machine learning model is designed by means of a deep reinforcement learning algorithm, and the hand-eye calibration-free mechanical arm grabbing system imitates human hand-eye coordination logic, so that an intelligent body can grab a target object autonomously under a continuously transformed visual angle, and mechanical arm grabbing without hand-eye calibration can be realized.
According to some embodiments, the present disclosure employs the following technical solutions:
the manipulator grabbing method for hand-eye calibration free based on reinforcement learning is characterized by comprising the following steps of:
constructing an intelligent agent interactive simulation environment, and defining a state value of the simulation environment for representing state information of the intelligent agent and the environment;
acquiring state data of the intelligent body, the gesture of a target object to be grabbed and identifying coordinates of grabbing points;
adopting an optimal closed loop grabbing strategy, enabling an intelligent body to perceive information of a current environment, calculating optimal grabbing actions according to the environment information, after executing the actions, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
According to some embodiments, the present disclosure employs the following technical solutions:
the manipulator grabbing system based on reinforcement learning and free of hand-eye calibration comprises an intelligent body and an intelligent simulation platform, wherein the intelligent body comprises a manipulator required by executing tasks; the agent simulation platform comprises:
the model construction module is used for constructing an intelligent agent interactive simulation environment, defining a state value of the simulation environment and representing state information of the intelligent agent and the environment;
the initialization module is used for acquiring state data of the intelligent body and the gesture of the target object to be grabbed, and identifying coordinates of grabbing points;
the action execution module is used for adopting an optimal closed loop grabbing strategy, enabling an intelligent agent to perceive information of the current environment, calculating an optimal grabbing action according to the environment information, after the action is executed, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
Compared with the prior art, the beneficial effects of the present disclosure are:
the method is based on a PPO algorithm, a Proximal strategy optimization method is used, a strategy gradient and importance sampling-based line learning strategy is used for predicting actions by inputting a 'state', and training is obtained in iteration with the environment, so that certain capability can be learned. Compared with the traditional robot control method, the sensitivity degree of reinforcement learning to parameters is lower than that of the traditional control method; reinforcement learning has advantages for difficult-to-model system control; reinforcement learning is a strategy, not simple parameter adjustment, so that beyond human preset capabilities can be learned. The use of reinforcement learning can help the robotic arm accomplish a particular task without requiring complex manual programming. The reinforcement learning does not need training data, and the cost of collecting training samples is saved by learning through interaction with the environment.
The end-to-end method provided by the present disclosure can be trained once and deployed for multiple times, so that the complicated step of hand-eye calibration is omitted; the new idea forms a closed loop between the hand-eye relationships, thus also solving the errors resulting from the accumulation. Different sensor conditions can be compatible through designing the corresponding input layer, so that a grabbing strategy with convenient deployment and reliable success rate is realized, and the grabbing strategy can be applied to mechanical arms in the scenes of logistics, assembly and the like. And meanwhile, the method is also one search for the application of reinforcement learning, and the reinforcement learning is proved to be feasible in specific applications.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.
FIG. 1 is a simulated reinforcement learning environment of an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a comparative learning structure according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a teacher and student network in accordance with an embodiment of the present disclosure;
FIG. 4 is a teacher network flow chart of an embodiment of the present disclosure;
fig. 5 is a student network flow diagram of an embodiment of the present disclosure.
The specific embodiment is as follows:
the disclosure is further described below with reference to the drawings and examples.
It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.
Example 1
An embodiment of the present disclosure provides a method for grabbing a manipulator arm based on reinforcement learning and free of hand-eye calibration, including:
step one: constructing an intelligent agent interactive simulation environment, and defining a state value of the simulation environment for representing state information of the intelligent agent and the environment;
step two: acquiring state data of the intelligent body, the gesture of a target object to be grabbed and identifying coordinates of grabbing points;
step three: adopting an optimal closed loop grabbing strategy, enabling an intelligent body to perceive information of a current environment, calculating optimal grabbing actions according to the environment information, after executing the actions, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
As an embodiment, the mechanical arm grabbing uses reinforcement learning for training, and the task is to enable the intelligent body to grab the target object autonomously under the continuously transformed view angle, so that the mechanical arm grabbing without hand-eye calibration can be realized, firstly, an intelligent body interactive simulation environment is constructed, and the states of all the objects in the interactive simulation environment are defined;
first, the appropriate type of robotic arm, end effector, and sensor needs to be selected. The model of the mechanical arm is UR5, which is designed and produced by the company Universal robot of Denmark, has 6 degrees of freedom, a payload of 5 kg, a working radius of 850 mm and repeated positioning accuracy of +/-0.1 mm, and is suitable for light assembly, pick-up, placement, grinding and polishing and other scenes. The model of the end effector is Robotiq manipulator 85, which is an adaptive robotic Gripper that can be used to grasp robots and the like for a variety of applications. The model of the vision sensor is Kinect, which is a 3D camera produced by Microsoft corporation, integrates a depth camera and a color camera, and is suitable for advanced computer vision tasks.
The sensor is a means for the agent to perceive the environment, and in this task, the sensor is composed of multiple Kinect cameras assumed to be in different positions. The base position of the sensor is 0.5m from the ground, the pitch angle is 40 degrees, and the sensor faces the intelligent body and a plurality of positions of the working area, as shown in figure 1, and comprises the positions of two Kinect sensors. The sensor is used for randomly adjusting the position and the posture in a certain range on the basis of the position according to a certain probability distribution and acquiring richer visual angle data. The sensor is arranged to collect a depth map and RGB color images with a respective rate of 640 x 480 as input for subsequent steps.
Secondly, an environment, simulation or real environment is defined, which consists of an agent and other physical elements, and can accept certain actions and produce corresponding results. In the task, the environment is a Coppelaiasim virtual simulation environment running in a computer, and as shown in fig. 1, the main functions of the simulation environment comprise simulation of a mechanical arm, simulation of accessories such as a sensor and a sucker, and physical simulation.
Another important function of the simulation environment is the interactive method provided for the external program to use. Coppelisisim provides a variety of interfaces, which are used in the present invention to provide interfaces and libraries for the Python language, and some of the functions in the simulation environment are implemented by Lua scripts.
After determining the environment, the state needs to be defined next, and the state is a variable in the environment and is the current state of the environment. In the task, the state comprises the tail end gesture of the mechanical arm, the data of each joint of the mechanical arm, the gesture of a target object to be grabbed and the coordinates of grabbing points, and the data of a sensor.
The definition of each state in this task is as follows:
1. the end pose of the mechanical arm, including the position and pose information, is represented by a 3D position and quaternion.
2. The data of each joint of the mechanical arm, including the angle and the angular velocity, are represented by a vector.
3. The pose of the object to be targeted, including the position and pose information, is represented by a 3D position and quaternion.
4. The coordinates of the grabbing points are represented by a 3D position.
5. The data of the sensor, including the depth image and the color image, are represented by image data.
Next, the definition of the behavior of the agent in the present task will be described, where the definition of the behavior of the agent in the present task is:
Figure BDA0004127756170000071
wherein Deltax is i 、Δy i 、Δz i Respectively represent the offset of the tail end of the mechanical arm in the directions of three coordinate axes,
Figure BDA0004127756170000072
indicating the rotation angle of the wrist motor of the robot arm, i.e. controlling the rotation angle of the end effector about the z-axis.
As an embodiment, in the second step, the state data of the agent, the gesture of the target object to be grabbed, and the coordinates of the grabbing point are acquired.
In the teacher network training stage, state data of the intelligent agent, the gesture of the target object to be grabbed and the grabbing point are directly read out through an interface of the simulation environment and used as privilege information. In the training and reasoning stage of the student network, the state of the intelligent body is deduced through the kinematics of the mechanical arm and the joint information of the mechanical arm; the gesture and the grabbing point of the target object to be grabbed are obtained through implicit detection of a deep neural network. The planning process is to calculate the optimal action according to the environment information, execute the action and perceive again to obtain new environment information, and iterate until the target object is successfully grabbed.
In the third step, an optimal closed loop grabbing strategy is adopted, an intelligent agent perceives information of the current environment, an optimal grabbing action is calculated according to the environment information, new environment information is obtained after the action is executed and perceived again, and the process is repeated until a target object is successfully grabbed; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
A suitable reinforcement learning algorithm and structure is selected. Proximal Policy Optimization (PPO) is a reinforcement learning algorithm, which is one of the more widely used algorithms at present. The core idea of the PPO algorithm is to maximize the benefit of the policy through limited steps while ensuring that the magnitude of the change in the policy is not too great. The PPO algorithm has the advantages of simplicity, easiness in implementation and good efficiency. The PPO algorithm may also be applicable to reinforcement learning tasks in a variety of different environments. To achieve this goal, the PPO algorithm uses a policy optimization method called Proximal, which consists of two parts: policy updating: the policy is updated by using an approximation of the policy gradient. Constraint optimization: the change amplitude of the strategy is limited by setting a constraint, so that the stability of the strategy is ensured.
The closed loop grabbing strategy is optimized by adopting a near-end strategy, comprises strategy updating and constraint optimization, and comprises the steps of updating the strategy by using an approximation of strategy gradient and setting a constraint to limit the change amplitude of the strategy.
The optimization process mainly comprises two steps: experience is collected and policies are updated.
In the experience collection phase, the agent interacts with the environment according to the current strategy, thereby generating some track and state transition samples. These samples may be used to update policies. In the update strategy phase, the algorithm optimization objective is to maximize the expected return of the experience trace. The PPO algorithm uses two different policy updating methods, one single update and one multiple updates. The single update method includes two steps: an objective function of the policy update is calculated and the policy parameters are updated using a method called projection gradient descent. In calculating the objective function, PPO uses a technique called Clipped Surrogate Objective to control the magnitude of the update by limiting the magnitude of the change in the policy parameters for each update. The size of this constraint is called update amplitude constraint (clip parameter). The multiple update method updates the policy by repeating a single update, each update using a new sample, and keeping the current policy unchanged. After each update, the algorithm saves the parameters of the current policy as old policy parameters in order to roll back to the previous state when needed. In general, the optimization process of the PPO algorithm is iterative. By continuously collecting experience and updating strategies, the algorithm gradually learns the optimal strategy parameters.
The traditional capture strategy is to detect a single frame, obtain a path and then execute the path, which is disadvantageous in that the frame used as calculation is not necessarily from the most clear and easily distinguished angles and opportunities, and the accuracy is very limited. Humans require constant attention from the eye while grabbing an object until the hand grabs the object. The advantage of this is that the whole process is closed-loop, and the closed-loop gripping strategy is a method for the robot to grip the object, wherein the gripping behavior and the perceived result of the robot are mutually influenced.
As an embodiment, in the closed loop grabbing strategy, the robot first perceives, acquires the information of the current environment, then calculates the optimal grabbing action according to the information, and then, the robot executes the action and perceives again to acquire the new environment information. This process is repeated until the robot successfully grips the item. The closed loop grabbing strategy has the advantages that the robot can adjust own actions according to real-time sensing results, so that grabbing success rate is improved.
And simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
Specifically, the reward mechanism is considered as a failure of the grabbing task when the mechanical arm of the intelligent body collides, exceeds the operation range or exceeds the maximum limiting step length, and the execution is terminated. When the intelligent agent fails to complete the grabbing task for a long time, rewards are generated to prompt the intelligent agent to complete the grabbing task rapidly.
Rewarding the action of the intelligent agent by the task, wherein the rewarding of the task is as follows:
R=R task completion +R Failure of +R Survival time
Figure BDA0004127756170000101
Figure BDA0004127756170000102
R Survival time =step*α,α<0。
When the intelligent mechanical arm collides, exceeds the operation range or exceeds the maximum limiting step length, the task is considered to be failed, and the execution of the grabbing task is terminated. When the intelligent agent fails to complete the task for a long time, a responsible reward is generated, so that the intelligent agent is urged to complete the task as soon as possible. Finally, the termination conditions of the execution of the agent are defined, and the termination conditions are as follows:
1) The mechanical arm successfully grabs the object;
2) The mechanical arm runs beyond a specified range;
3) The mechanical arm collides with the environment;
4) The running step size exceeds 100 steps.
Before the execution of the grabbing task, the grabbing execution strategy is optimized and trained, and contrast learning and privilege learning are introduced, so that the distance between the data acquired by the current intelligent agent and the data at different angles at the same moment is reduced, the distance between the data acquired by the current intelligent agent and the data at different angles at different moments is increased, and the intelligent agent can continuously acquire the relative relation between the mechanical arm and the target object to be grabbed.
As shown in fig. 3, training is largely divided into two processes: training of a teacher's network and training of a student's network. Firstly, training of a teacher network is performed, and the input of the training of the teacher network is privilege information and the output of the previous frame.
Privilege information refers to environmental variables that can be read in a simulation environment but cannot be obtained by a robot or a sensor, and in this task, the pose of a mechanical arm, the pose of an object to be grasped, and the position of the sensor are regarded as privilege information.
Although the robot arm pose is readable through the API of the robot arm, it is also used herein as privilege information. The privilege information and the history output are input into the multi-layer perceptron to obtain a feature vector containing the required information, and the feature vector is used as a real state of reinforcement learning. The intelligent agent performs the output action to interact with the environment, so that environmental rewards are obtained and are used for evaluating the action under the current state. Through repeated iterative training, the teacher network can grasp through the privilege information.
The training of the student network is next, and the student network cannot adopt privilege information which cannot be normally acquired because the student network is applied and tested. The input of the student network comes from the sensor, and the depth network ResNet50 is adopted for data dimension reduction because the dimension of the image data read from the sensor is too high to be processed by using the multi-layer perceptron in the teacher network.
The next steps are consistent with the teacher network. Migrating data from the teacher's network to the student's network, three operations are employed here:
supervision of feature vectors, sharing of multi-layer perceptron parameters, and supervision of motion outputs.
The first step: and (5) supervision of the feature vector. First, the output of the teacher network is used as a feature vector to monitor the student network, so that the output of the student network is close to the output of the teacher network.
And a second step of: sharing of multi-layer perceptron parameters. In order to improve training efficiency, parameters of the multi-layer perceptron of the teacher network are shared to the student network.
And a third step of: and (5) supervision of action output. In the training process of the student network, the action output by the teacher network is used as supervision to help the student network gradually learn the behavior strategy of the teacher network.
Through these three steps, the teacher's network's knowledge and experience is effectively migrated to the student's network. In the common supervision and iteration process, the student network gradually and independently learns, and finally, the independent operation capability is achieved. At this time, the teacher network may be removed and the student network may be operated alone.
After common iteration and supervision of the teacher network and the student network, the final student network has the capability of independent operation after convergence. At this time, the teacher network can be removed and the student network can be used alone.
In order to further improve the training effect, contrast learning is introduced in the training process. Contrast learning is a machine learning technique that can help a machine learning system learn patterns from given training data to better predict unknown data. The basic idea is that by comparing different training samples, the machine learning system can learn better patterns and thus predict the unknown data better. In reinforcement learning, contrast learning may be used to train agents to obtain optimal strategies.
The specific implementation method is as follows: one sample at a time and one positive sample of the sample and one negative sample of the sample. The positive samples are selected from the data from different sensors at the same time and the negative samples are the sensor data at different times. The purpose of this arrangement is to allow the agent to notice more quickly the commonality at different viewing angles at the same time. Because the relative relationship of the robotic arm and the object to be grasped is consistent at different viewing angles, there is a difference at different times.
The use of a contrast learned structure and ternary loss function is shown in fig. 2. Making the distance between the current image and the positive sample smaller; the distance between the current image and the negative sample is made larger. In this process, the agent can more quickly notice the relative relationship between the robotic arm and the object to be grasped.
loss=max(0,dist(A,P)-dist(A,)+margin)
Wherein A is current image data; p is a positive sample from different sensors at the same time; n is a negative sample from different moments; maegan is a threshold.
Example 2
An embodiment of the disclosure provides a robot arm grabbing system based on reinforcement learning free hand-eye calibration, which comprises an intelligent body and an intelligent simulation platform, wherein the intelligent body comprises a robot arm required for executing tasks; the agent simulation platform comprises:
the model construction module is used for constructing an intelligent agent interactive simulation environment, defining a state value of the simulation environment and representing state information of the intelligent agent and the environment;
the initialization module is used for acquiring state data of the intelligent body and the gesture of the target object to be grabbed, and identifying coordinates of grabbing points;
the action execution module is used for adopting an optimal closed loop grabbing strategy, enabling an intelligent agent to perceive information of the current environment, calculating an optimal grabbing action according to the environment information, after the action is executed, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the specific embodiments of the present disclosure have been described above with reference to the drawings, it should be understood that the present disclosure is not limited to the embodiments, and that various modifications and changes can be made by one skilled in the art without inventive effort on the basis of the technical solutions of the present disclosure while remaining within the scope of the present disclosure.

Claims (10)

1. The manipulator grabbing method for hand-eye calibration free based on reinforcement learning is characterized by comprising the following steps of:
constructing an intelligent agent interactive simulation environment and defining a state value of the simulation environment;
acquiring state data of the intelligent body, the gesture of a target object to be grabbed and identifying coordinates of grabbing points;
adopting an optimal closed loop grabbing strategy, enabling an intelligent body to perceive information of a current environment, calculating optimal grabbing actions according to the environment information, after executing the actions, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
2. The robot arm grabbing method based on reinforcement learning free hand-eye calibration according to claim 1, wherein the state data of the intelligent body are the tail end gesture of the robot arm, the joint data of the robot arm, the gesture of the object to be grabbed, the coordinates of the grabbing point and the data of the sensor.
3. The method for gripping a robot arm based on reinforcement learning free hand-eye calibration according to claim 1, wherein the reward mechanism is that the robot arm is considered to be failed in gripping task when collision, exceeding the operation range or exceeding the maximum limiting step length occurs, and the execution is terminated.
4. The robot arm grabbing method based on reinforcement learning free hand-eye calibration of claim 3, wherein when the intelligent agent fails to complete the grabbing task for a long time, a reward is generated to prompt the intelligent agent to complete the grabbing task quickly.
5. The robot arm grabbing method based on reinforcement learning free hand-eye calibration of claim 1, wherein when the object is successfully grabbed by the intelligent agent, the grabbing task is terminated.
6. The reinforcement learning-based hand-eye calibration-free robotic arm grasping method of claim 1, wherein the closed loop grasping strategy is optimized with a near-end strategy, comprising strategy updating and constraint optimization, comprising updating the strategy by using an approximation of the strategy gradient, and setting a constraint to limit the magnitude of the variation of the strategy.
7. The method for grabbing the manipulator based on reinforcement learning and free of hand and eye calibration according to claim 6, wherein the comparison learning and the privilege learning are introduced while the closed loop grabbing strategy is optimized, so that the distance between the data acquired by the current agent and the data with different visual angles at the same time is reduced, the distance between the data acquired by the current agent and the data with different visual angles at different times is increased, and the relative relation between the manipulator and the object to be grabbed can be continuously acquired by the agent.
8. The robot arm grabbing method based on reinforcement learning and free of hand and eye calibration according to claim 7, wherein training is divided into two processes, namely training of a teacher network and training of a student network, wherein input of the training of the teacher network is privilege information, the privilege information refers to the posture of the robot arm, the posture of a target object to be grabbed and the position of a sensor, executing actions of an intelligent body are output, the intelligent body executes the output actions to interact with the environment, the training is iterated for a plurality of times, and the teacher network grabs through the privilege information.
9. The robot arm grabbing method based on hand-eye calibration of reinforcement learning of claim 8, wherein the training input of the student network is externally read image data, the externally read image data is utilized for training, then the data of the teacher network is migrated into the student network, and the student network is finally used as a using network through common iteration and supervision of the teacher network and the student network.
10. The manipulator grabbing system based on reinforcement learning and free of hand-eye calibration is characterized by comprising an intelligent body and an intelligent simulation platform, wherein the intelligent body comprises a manipulator required by executing tasks; the agent simulation platform comprises:
the model construction module is used for constructing an intelligent agent interactive simulation environment and defining a state value of the simulation environment;
the initialization module is used for acquiring state data of the intelligent body and the gesture of the target object to be grabbed, and identifying coordinates of grabbing points;
the action execution module is used for adopting an optimal closed loop grabbing strategy, enabling an intelligent agent to perceive information of the current environment, calculating an optimal grabbing action according to the environment information, after the action is executed, perceiving again, acquiring new environment information, and repeating until a target object is grabbed successfully; and simultaneously, recognizing the behavior of the agent for grabbing the target object, defining a reward mechanism for the behavior generated in the grabbing process, and supervising the grabbing behavior of the agent.
CN202310250707.7A 2023-03-10 2023-03-10 Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration Pending CN116276998A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310250707.7A CN116276998A (en) 2023-03-10 2023-03-10 Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310250707.7A CN116276998A (en) 2023-03-10 2023-03-10 Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration

Publications (1)

Publication Number Publication Date
CN116276998A true CN116276998A (en) 2023-06-23

Family

ID=86830104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310250707.7A Pending CN116276998A (en) 2023-03-10 2023-03-10 Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration

Country Status (1)

Country Link
CN (1) CN116276998A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116834018A (en) * 2023-08-07 2023-10-03 南京云创大数据科技股份有限公司 Training method and training device for multi-mechanical arm multi-target searching
CN117313546A (en) * 2023-10-26 2023-12-29 北京大学 Trusted smart hand system simulation method and simulation system
CN117313546B (en) * 2023-10-26 2024-07-26 北京大学 Trusted smart hand system simulation method and simulation system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116834018A (en) * 2023-08-07 2023-10-03 南京云创大数据科技股份有限公司 Training method and training device for multi-mechanical arm multi-target searching
CN117313546A (en) * 2023-10-26 2023-12-29 北京大学 Trusted smart hand system simulation method and simulation system
CN117313546B (en) * 2023-10-26 2024-07-26 北京大学 Trusted smart hand system simulation method and simulation system

Similar Documents

Publication Publication Date Title
CN110000785B (en) Agricultural scene calibration-free robot motion vision cooperative servo control method and equipment
CN110076772B (en) Grabbing method and device for mechanical arm
JPH06314103A (en) Controller and passive sensing device
Huang et al. Grasping novel objects with a dexterous robotic hand through neuroevolution
CN111223141A (en) Automatic assembly line work efficiency optimization system and method based on reinforcement learning
Li et al. Neural learning and kalman filtering enhanced teaching by demonstration for a baxter robot
CN116276998A (en) Arm grabbing method and system based on reinforcement learning and free of hand-eye calibration
Xu et al. Uncalibrated visual servoing of mobile manipulators with an eye-to-hand camera
CN117103277A (en) Mechanical arm sensing method based on multi-mode data fusion
Aleotti et al. Position teaching of a robot arm by demonstration with a wearable input device
Mielke et al. Human-robot co-manipulation of extended objects: Data-driven models and control from analysis of human-human dyads
CN113119073A (en) Mechanical arm system based on computer vision and machine learning and oriented to 3C assembly scene
Hansen et al. Transferring human manipulation knowledge to robots with inverse reinforcement learning
Nazari et al. Deep functional predictive control (deep-fpc): Robot pushing 3-d cluster using tactile prediction
CN111015676A (en) Grabbing learning control method and system based on hands-free eye calibration, robot and medium
Gomes et al. Deep Reinforcement learning applied to a robotic pick-and-place application
Fjerdingen et al. A learning camera platform for remote operations with industrial manipulators
Li et al. Manipulator Motion Planning based on Actor-Critic Reinforcement Learning
US11921492B2 (en) Transfer between tasks in different domains
Zhong et al. Regressor-Based Model Adaptation for Shaping Deformable Linear Objects with Force Control
Ruud Reinforcement learning with the TIAGo research robot: manipulator arm control with actor-critic reinforcement learning
CN115890744B (en) TD 3-based mechanical arm 6-DOF object manipulation training method and system
Elachkar Robot Learning From Human Observation Using Deep Neural Networks
Zürn et al. Software architecture for deformable linear object manipulation: A shape manipulation case study
Tao et al. A Robot Self-learning Grasping Control Method Based on Gaussian Process and Bayesian Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination