CN118254170A

CN118254170A - Mechanical arm progressive training method based on deep reinforcement learning, storage medium and electronic equipment

Info

Publication number: CN118254170A
Application number: CN202410338659.1A
Authority: CN
Inventors: 王赞; 陈慧彬; 连晨轩; 董孟豪; 吴俊霆; 万富瑞; 喻方; 胡淏泓; 孙龙辉; 陈立家
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2024-03-22
Filing date: 2024-03-22
Publication date: 2024-06-28

Abstract

The invention discloses a mechanical arm progressive training method based on deep reinforcement learning, a storage medium and electronic equipment, which specifically comprise the following steps: constructing a mechanical arm simulation working environment according to a real scene, and completing a first-stage training task of a progressive training system by adopting a model-free algorithm; adopting a world model algorithm to complete a second stage training task of the progressive training system; and performing specific task execution on the mechanical arm by adopting a trained model-free algorithm and a world model algorithm. According to the invention, the complex task is decomposed, the working space is divided into the coarse operation area and the fine operation area from outside to inside, the training of the mechanical arm is also divided into the coarse training phase and the fine training phase according to the working space, and the training scheme in the system is superior to the training time and the training precision by combining a non-end-to-end mode and an end-to-end mode, which are respectively superior to the training time and the training precision by adopting a model-free algorithm or a world model algorithm, so that the training efficiency of the mechanical arm is greatly improved.

Description

Mechanical arm progressive training method based on deep reinforcement learning, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of visual mechanical arms, in particular to a mechanical arm progressive training method based on deep reinforcement learning.

Background

At present, the deep reinforcement learning technology has been expanded to the field of mechanical arms, however, the application of the deep reinforcement learning method in reality is usually in a Sim-to-real mode, i.e. the mechanical arms are trained in a simulation environment, and training results are fed back to reality, which is the most representative magic cube like OpenAI, but the application mode cannot be applied to all real scenes. Visual recognition and manipulation of objects is a relatively simple task for humans, but it is still a very challenging task for robotic arms. Based on the above, researchers also put forward various improvement measures for Sim-to-real mode, such as adopting a teaching type reinforcement learning mode, using model-free algorithm based on sparse rewards, using human demonstration and actual interaction as experience pool data, and replacing rewarding functions which are difficult to adjust with human task demonstration, thus being capable of completing specified tasks. However, the method is limited by the capability of human beings, and some tasks may be too dangerous, tedious or beyond the capability range of human beings, and cannot be demonstrated, so that the labor cost is high and the flexibility is lacking; for example, the world model algorithm proposed by *** team models the environment interacted with by the intelligent agent, learns the compression space and time representation of the environment, and can train a compact and simple strategy to solve the required task by taking the characteristics extracted from the model as the input of the intelligent agent.

Deep reinforcement learning can be categorized into non-end-to-end training and end-to-end training. The non-end-to-end training mode generally requires manual feature extraction pretreatment on original data, uses extracted non-image feature information for training, for example, uses coordinate information as input for training, can reduce the number of steps required by training, thereby shortening the training time of the mechanical arm, but errors generated during data pretreatment can be gradually accumulated along with training, and face complex tasks with high precision requirements, the training effect of a model is poor, so that the model is generally suitable for completing simple tasks with low precision requirements. The end-to-end training mode is to train the mechanical arm from input to output directly without manual feature extraction or processing steps, and the training mode is to train the mechanical arm by taking images as input, so that the mechanical arm can be well qualified for complex tasks with high precision requirements, but in the training process, a large amount of redundant information is generated in image processing, so that the training time of the mechanical arm is greatly prolonged. Aiming at the problems of the two training modes, domestic and foreign scholars put forward various improvement measures, such as a mechanical arm grabbing frame based on deep reinforcement learning by adopting a non-end-to-end training mode, wherein a visual perception control strategy is independently trained, visual perception generates physical description of the grabbed object, the strategy utilizes the physical description to determine the optimal action, and the training time can obtain good results only about 30 minutes, but is limited by grabbing at a fixed height and lacks versatility. In addition, the mobile operation system adopting the end-to-end training mode can realize autonomous grabbing under various simulation and real scenes, but the process needs four million times of training, consumes a great amount of training time and is not suitable for floor application.

Disclosure of Invention

The invention aims to provide a deep reinforcement learning-based mechanical arm progressive training method, a storage medium and electronic equipment, which can improve the training effect in the mechanical arm training process and shorten the training time.

The invention adopts the technical scheme that:

A mechanical arm progressive training method based on deep reinforcement learning comprises the following steps:

step 1: constructing a mechanical arm simulation working environment according to a real scene, and specifically:

Step 1.1: selecting the type of the mechanical arm, the required tail end clamping jaw of the mechanical arm and the part to be placed in a simulation environment according to a real scene; placing a plurality of cameras around the mechanical arm to obtain a multi-view image of the mechanical arm;

Step 1.2: the simulation working environment is ensured to be consistent with the reality scene through the calibration of the simulation system;

Step 1.3: a spherical working space taking a preset grabbing and placing target point as a sphere center is divided into a coarse operation area and a fine operation area from outside to inside, the center of the sphere is the fine operation area, and then the number of actions required by the mechanical arm and the execution range of the actions are determined according to the coarse operation area and the fine operation area;

step 1.4: the training of the placement direction of the mechanical arm is a first-stage training task, namely a coarse training stage, and the training of the mechanical arm for continuously completing the complex placement task after the placement direction of the mechanical arm is determined is a second-stage training task, namely a fine training stage, and specific task rewarding functions are respectively set for the coarse training stage and the fine training stage;

step 1.5: performing a digital action randomly to obtain corresponding feedback;

Step 2: adopting a model-free algorithm to complete a first-stage training task of the progressive training system;

Step 3: adopting a world model algorithm to complete a second stage training task of the progressive training system;

Step 4: and performing specific task execution on the mechanical arm by adopting a trained model-free algorithm and a world model algorithm.

The step 2 specifically comprises the following steps:

Step 2.1: when training a task model in the first stage, firstly playing back selected samples by using a priority experience, and temporarily storing the samples;

Step 2.2: training any one of the samples in the batch by adopting an AC architecture network model in a model-free algorithm;

Step 2.3: updating parameters of the AC architecture network model in the first stage training;

Step 2.4: and (3) observing whether the return function is converged, if not, continuing to repeat the step (2.3) for carrying out the next sample iterative training, and if so, enabling the tail end clamping jaw of the mechanical arm to reach the range to be placed, and entering the next training stage.

The step 3 specifically comprises the following steps:

Step 3.1: the mechanical arm performs a number of rounds of random actions in a range to be placed, acquires corresponding step number tracks, and fills the tracks into an experience pool;

step 3.2: randomly selecting a continuous track sequence from the experience pool for training parameters in the world model;

Step 3.3: predicting future rewards values by using the network model in the world model predictor, and updating the AC network model parameters in the world model planner in the second stage training;

step 3.4: the mechanical arm selects actions by utilizing hidden variables in the world model and interacts with the environment, and the experience pool is continuously filled;

step 3.5: and observing whether the return function is converged, if not, continuing to iterate the training, and if so, using the algorithm in a specific task of the mechanical arm.

And 3.2, randomly selecting a continuous track sequence from the data set, wherein the continuous track sequence comprises a local scene image formed by the tail end clamping jaw of the mechanical arm and the target object so as to more retain useful information in the image, compressing the image into a hidden variable after the image passes through an encoder, and after the image is fused with the hidden variable of the world model, learning the hidden variable of the world model and representing three-dimensional state compression information.

The reward function formula of the coarse training stage of the training system in the step 1.4 is as follows:

Wherein dist [0] and dist [1] are distances from the center point of the end clamping jaw to the center point of the target object in the lower x, y direction;

The fine training phase bonus function of the training system is divided into three parts, near bonus R _reach, pick-up bonus R _grap and pick-up bonus R _lift, where pick-up and pick-up rewards are {0,0.25} and {0,1} respectively, and the near bonus R _reach is calculated as follows:

R_reach＝(1-tanh(10*d)) (2)

Where d is the Euclidean distance of the end jaw center from the target object center point, and tanh is the hyperbolic tangent function.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes a device in which the computer readable storage medium resides to perform a deep reinforcement learning based robotic arm progressive training method as described above.

An electronic device, comprising: the system comprises a memory and a processor, wherein a program capable of running on the processor is stored in the memory, and the processor realizes the mechanical arm progressive training method based on deep reinforcement learning when executing the program.

According to the invention, the spherical working space taking the preset grabbing and placing target point as the sphere center is divided into the coarse operation area and the fine operation area from outside to inside, so that the training of the mechanical arm is also divided into a coarse training stage and a fine training stage according to the coarse operation area and the fine operation area. The first stage is coarse training, the task is simpler, only the placement direction of the mechanical arm is required to be determined, and the mechanical arm clamping jaw is sent into the fine operation area from the coarse operation area by adopting a model-free algorithm. The second stage is fine training, the mechanical arm is required to continuously complete complex tasks in a fine operation area, the accuracy is required to be high, and the world model algorithm is adopted for placement training. According to the invention, the model-free algorithm and the world model algorithm are combined, the mechanical arm training is divided into two stages, the problem of long training time in a training mode by taking an image as input is effectively solved by using a non-end-to-end mode in the first stage, the whole training time of the mechanical arm is reduced, the problem of poor training effect of the model-free algorithm can be effectively solved by using the world model algorithm in the second stage, the placement success rate of the mechanical arm is obviously improved, and the input adopts a local scene image, so that useful information in the image is more reserved, and the training time is greatly shortened.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a graph comparing the performance of the present invention with model-free and world model algorithms used alone.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1 and 2, the present invention includes the steps of:

Step 1: constructing a mechanical arm simulation working environment according to a real scene;

step 1.1: the simulation mechanical arm is a Bronsted BRTIRUS A type six-axis mechanical arm, the tail end clamping jaw of the mechanical arm is a three-finger clamping jaw, and a small square to be placed is placed in front of the mechanical arm. Placing a plurality of cameras around the mechanical arm to obtain a multi-view image of the mechanical arm;

step 1.3: the mechanical arm has six degrees of freedom, and the mechanical claw only grabs and releases two motion instructions, and the motion quantity is 0 or 1, so the mechanical arm has seven motion quantities in total, and the motion limiting range is as follows:

{[-20°,60°]，[-60°,20°]，[-45°,80°]，[-90°,90°]，[-30°,60°]，[-90°,90°]，{0,1}}

step 1.4: specific task rewarding functions are respectively set for the coarse training stage and the fine training stage, and the rewarding function formula of the coarse training stage of the training system is as follows:

R_reach＝(1-tanh(10*d)) (2)

Where d is the Euclidean distance of the end jaw center from the target object center point, and tanh is the hyperbolic tangent function;

Step 1.5: testing the working consistency of the constructed mechanical arm simulation working environment and equipment such as a mechanical arm, a camera and the like in a real scene, randomly executing a digital action, and obtaining corresponding feedback;

step 2: the first stage training task of the progressive training system is completed by adopting a model-free algorithm, and the method is as follows:

step 2.1: in the aspect of selecting samples from an experience pool, the smaller TD_error represents better training effect for the critic network, so that the larger TD_error is selected from the experience pool, which is more significant for the current training, and the Q estimated value of the critic network is more and more close to Q reality by gradually reducing TD_error through learning the samples. The sample selected from the experience pool is temporarily stored;

Step 2.2: calculating a loss function by using the temporarily stored samples through the value network Q estimation and Q reality to optimize the current network and the target network of the value network, wherein the calculation formula of the loss function is as follows:

Where N is the number of samples, i is the current time, y _i is the target Q value, Q is the real Q value, s _i is the current state, a _i is the current action, and θ ^Q is the parameter of the current network of critic.

Step 2.3: evaluating the sample which is temporarily stored by using the optimized value network, and calculating Q reality and Q estimation;

Step 2.4: updating parameters of the current network and the target network of the strategy, wherein an updating coefficient is added in an algorithm shown in a formula (4), and each time the parameters are updated only a bit more than the previous time;

θ^μ′←τθ^μ+(1-τ)θ^μ′ (4)

Where μ is the current policy, u' is the current policy, τ is the update coefficient, θ ^μ is the parameters of the actor current network, and θ ^μ′ is the parameters of the actor target network.

Step 2.5: observing whether the return function is converged or not, if not, transferring to the step 2.1 to collect data into an experience pool, continuing iterative training, if so, enabling the clamping jaw at the tail end of the mechanical arm to reach the range to be placed, and entering the next training stage;

step 3: the training task of the second stage of the progressive training system is completed by adopting a world model algorithm, and the method is as follows:

Step 3.1: before the algorithm is trained, the mechanical arm executes 5 rounds of random actions within the range to be placed, 500 steps are executed for each round, corresponding step number tracks are collected, the random sampling action A _t is filled in an experience pool, after the mechanical arm executes the action A _t in any state S _t, the state is converted into S _t+1, a corresponding reward value R _t is obtained, and 2500 groups of track sequences are generated (S _t,A_t,R_t,S_t+1)

Step 3.2: randomly selecting a continuous track sequence from an experience pool, inputting a random distribution state and a deterministic circulation state of actions and images in data of the sequence, which are initialized to zero, into a circulation state space model for iteration, and calculating a loss function of a world model for updating a world model parameter theta, wherein the world model loss function has the following formula:

Wherein the world model parameters are represented by θ, x _t is a predicted image of the current state, r _t is a predicted reward value of the current state, γ _t is a discount factor of the current state prediction, h _t is a deterministic hidden variable at the current time, z _t is a stochastic hidden variable at the current time, q _θ functions are known conditions h _t and x _t, a posterior probability distribution inferred by the parameters θ, and p _θ functions are known conditions h _t and z _t, and a priori probability distribution inferred by the parameters θ.

Step 3.3: the algorithm learns long term rewards behavior in the world model using a actor network and a critic network. critic are used to predict future jackpots, as well as rewards that are out of schedule. actor are used for planner learning, which extrapolates the action that can output the greatest value from the input compact world model hidden variables. The network critic may be trained using a square loss function given a continuous trajectory of model states, rewards, and discount factors. The actor network aims to output actions which allow the critic network to maximize future long-term rewards, so that the rewards in the step H predicted by the critic network need to be taken into account when designing the loss function of the actor network, and the entropy of the actions is added to the loss function in order to encourage the agent to perform some favorable exploration on the actions. The loss function formulas for Critic and actor networks are as follows:

Wherein parameters of critic network are denoted by xi, H is iteration number, z _t is randomness hidden variable at current moment, lambda is weight value, v _ξ is value of critic network prediction, Is the prize value at the current time.

Wherein actor network parameters are usedIndicating that H is the number of iterations, ρ is the weight, z _t is the randomness argument of the current time, a _t is the action of the current time,The function being a known condition z _t, in terms of parametersThe prior probability distribution of a _t is deduced, eta is the entropy coefficient, and H [ a _t|z_t ] is the motion entropy.

Step 3.4: when the world model, actor network and critic network are updated, the robotic arm performs actions resulting from compact hidden states in the world model and adds some noise on its basis to increase the randomness of the exploration. After the interaction is finished, filling data generated by the interaction between the mechanical arm and the environment into an experience pool;

Step 3.5: and (3) observing whether the return curve is converged or not, if not, turning to the step (3.2) to collect the track sequence, continuing iterative training, and updating the network parameters. If converged, the algorithm may be used in the specific task of the robotic arm.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes a device in which the computer readable storage medium resides to perform a deep reinforcement learning based robotic arm progressive training method as described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory, other memories, and the like.

The integrated modules/units of the electronic device according to the application may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on this understanding, the present application may also be implemented by a computer program for instructing a relevant hardware device to implement all or part of the procedures of the above-mentioned embodiment method, where the computer program may be stored in a computer readable storage medium and the computer program may be executed by a processor to implement the steps of each of the above-mentioned method embodiments.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

Compared with the prior art, the invention has the beneficial effects that:

The invention provides a deep reinforcement learning-based mechanical arm progressive training method which is divided into two training stages, wherein the first stage uses a model-free algorithm for training and the second stage uses a world model algorithm for training.

According to the embodiment of the invention, the mechanical arm is trained by combining a model-free algorithm and a world model algorithm, a progressive placement scheme is adopted, firstly, the model-free algorithm is used for determining the placement direction, the tail end clamping jaw of the mechanical arm is sent into the range to be placed, and then the world model algorithm is used for accurate placement. As shown in fig. 2, the abscissa indicates the number of rounds of training, and the ordinate indicates the return value obtained for each round. The solid line part curve is a graph after training of the proposed scheme, the graph converges at about 600episodes, the dotted line part drawn by a short line is a graph after training by using a world model algorithm alone, the graph converges at about 1000espisodes, and the overall training time of the scheme can be shortened by 40% under the task success rate of 98%. The dotted line part is the effect after training by using the model-free algorithm alone, the convergence value is about 400, and the convergence value of the scheme is about 950, so that the training effect is improved by about 60% compared with that of the training by using the model-free algorithm alone, and the training system is a more efficient and accurate mechanical arm training system.

In a three-dimensional scene, the complex task of the mechanical arm can exponentially increase the state action space, so that a huge amount of training time is required. Therefore, the invention divides the working space into a coarse operation area and a fine operation area from outside to inside by decomposing the complex task, and the training of the mechanical arm is also divided into a coarse training stage and a fine training stage according to the sequence. The coarse training stage and the fine training stage are trained by adopting a model-free algorithm and a world model algorithm respectively. The fine training stage takes a local scene image which only comprises the mechanical arm clamping jaw and the target object as an input end. The invention can accelerate the training speed of the mechanical arm by combining the model-free reinforcement learning and the model reinforcement learning, and the training scheme in the system is superior to the model-free algorithm or the world model algorithm in training time and training precision by combining a non-end-to-end mode and an end-to-end mode. The invention greatly improves the training efficiency of the mechanical arm, is convenient for the mechanical arm to complete more complex tasks, and simultaneously realizes the floor application of the artificial intelligence theory. Under the condition of the same task, the rate of convergence of the report value of the scheme is 40% faster than that of the model reinforcement learning training time. After the mechanical arm is trained by using the scheme, the training effect of the mechanical arm is improved by 60% compared with the training effect of model-free reinforcement learning alone.

In the description of the present invention, it should be noted that, for the azimuth words such as "center", "lateral", "longitudinal", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc., the azimuth and positional relationships are based on the azimuth or positional relationships shown in the drawings, it is merely for convenience of describing the present invention and simplifying the description, and it is not to be construed as limiting the specific scope of protection of the present invention that the device or element referred to must have a specific azimuth configuration and operation.

It should be noted that the terms "comprises" and "comprising," along with any variations thereof, in the description and claims of the present application are intended to cover a non-exclusive inclusion, such as a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Note that the above is only a preferred embodiment of the present invention and uses technical principles. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the present invention has been described in connection with the above embodiments, it is to be understood that the invention is not limited to the specific embodiments disclosed and that many other and equally effective embodiments may be devised without departing from the spirit of the invention, and the scope thereof is determined by the scope of the appended claims.

Claims

1. A mechanical arm progressive training method based on deep reinforcement learning is characterized in that:

The method comprises the following steps:

2. The deep reinforcement learning-based mechanical arm progressive training method of claim 1, wherein: the step 2 specifically comprises the following steps:

3. The deep reinforcement learning-based mechanical arm progressive training method of claim 1, wherein: the step 3 specifically comprises the following steps:

4. The deep reinforcement learning-based mechanical arm progressive training method of claim 1, wherein: and 3.2, randomly selecting a continuous track sequence from the data set, wherein the continuous track sequence comprises a local scene image formed by the tail end clamping jaw of the mechanical arm and the target object so as to more retain useful information in the image, compressing the image into a hidden variable after the image passes through an encoder, and after the image is fused with the hidden variable of the world model, learning the hidden variable of the world model and representing three-dimensional state compression information.

5. The deep reinforcement learning-based mechanical arm progressive training method of claim 1, wherein: the reward function formula of the coarse training stage of the training system in the step 1.4 is as follows:

(1)

Wherein, AndIs the distance from the center point of the tail clamping jaw to the center point of the target object in the lower x and y directions;

The reward function of the training system in the fine training stage is divided into three parts, which are close to rewards Pick up rewardsAnd raising a prizeWherein the pick-up and lift-up rewards are {0,0.25} and {0,1} respectively, approaching the rewardsThe calculation formula is as follows:

(2)

Wherein, Is the euclidean distance of the center of the end jaw from the center point of the target object,Is a hyperbolic tangent function.

6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, causes a device in which the computer readable storage medium is located to perform the deep reinforcement learning-based mechanical arm progressive training method according to any one of claims 1-5.

7. An electronic device, comprising: the system comprises a memory and a processor, wherein the memory stores a program capable of running on the processor, and the processor realizes the mechanical arm progressive training method based on deep reinforcement learning according to any one of claims 1-5 when executing the program.