CN114415507B

CN114415507B - Deep neural network-based smart hand-held process dynamics model building and training method

Info

Publication number: CN114415507B
Application number: CN202210017100.XA
Authority: CN
Inventors: 周锦瑜; 盛士能; 王壮; 祝雯豪; 俞冰清; 鲍官军; 胥芳
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2024-05-28
Anticipated expiration: 2042-01-07
Also published as: CN114415507A

Abstract

The invention discloses a deep neural network-based smart hand-held process dynamics model building and training method, which comprises the following steps: dexterous hand using SAC algorithm strategyInteracting and training with the object to be gripped in the environment, collecting system state transition data and placing the system state transition data into a sample buffer area; step 2: setting the category number of fuzzy clustering and carrying out fuzzy clustering on state transition data in a sample buffer area; step 3: building sub-network model containing state increment direction probabilityState increment amplification subnetwork modelA smart manual mechanical model f; step 4: generating sample sampling probability according to membership degree for each class of fuzzy clustering, and sampling to obtain training samples; step 5: and training a smart manual mechanical model, and predicting the state of the environmental system. The method improves the accuracy of the dynamic model, reduces the local overfitting phenomenon in the dynamic model training process, reduces the prediction error of the dynamic model and improves the stability.

Description

Deep neural network-based smart hand-held process dynamics model building and training method

Technical Field

The invention belongs to the field of intensive learning control of smart hands, and particularly relates to a dynamic model building and training method for a capturing process of a smart hand based on a deep neural network.

Background

Due to the high degree of freedom of the dexterous hand, how to improve the control effect of the dexterous hand reinforcement learning control algorithm and the utilization rate of training sample data becomes a difficulty in the field of the dexterous hand reinforcement learning control algorithm. Currently, reinforcement learning control algorithms can be classified into model-free reinforcement learning algorithms and model-based reinforcement learning algorithms according to whether an Agent (Agent) understands an environment and its own dynamic model. Through the reinforcement learning algorithm, the intelligent agent can conduct interaction trial and error from the same environment aiming at a specific task, and acquire environmental feedback rewards in the process, so that the behavior of the intelligent agent is changed, and the environmental feedback rewards are maximized in the next interaction process with the environment.

Model-free reinforcement learning control algorithms currently applicable to the mainstream of smart hand control include: DEEP DETERMINISTIC Policy Gradient (DDPG), soft Actor Critic (SAC), proximal Policy Optimization (PPO) and the like, the model-free reinforcement learning control algorithm has better control performance, but has low sample utilization rate on data, and a large number of samples need to be collected, which requires a large time cost and is often difficult to apply in reality. While the model-based reinforcement learning control algorithm has the advantage of utilizing the sample data. The current mainstream model-based reinforcement learning control algorithms include: alphaZero, imagination-Augmented Agents (I2A), MBMF, and the like. These algorithms either require domain experts to provide a model of system dynamics based on expertise or require learning system dynamics from the interactive process of the environment. However, a system composed of a smart hand and a gripped object has a large number of collision scenes, and it is difficult to give a system dynamics model in advance, and it is necessary to obtain an approximate system dynamics model by supervised learning. Examples of the method of supervised learning dynamics model are: sparse Identification of Nonlinear Dynamics (SINDy), kinetic parameter identification, neural network fitting kinetic models, and the like. SINDy and related methods require a set of functional dictionaries, and thus have the disadvantage of great difficulty in application. Kinetic parameter identification requires advance of a system kinetic model framework and is not applicable to a system consisting of a smart hand rich in collision and a gripped object. The existing method for fitting the dynamic model by using the neural network has the problems of poor stability, easiness in generating over-fitting phenomenon and the like.

Disclosure of Invention

The invention aims to provide a dynamic model building and training method in a flexible hand-held process based on a deep neural network, which aims to solve the technical problems that a dynamic model cannot be given by a current system, the difficulty of giving a functional dictionary set is high, the stability of a neural network fitting dynamic model is poor, and the fitting phenomenon is easy to generate.

In order to solve the technical problems, the specific technical scheme of the deep neural network-based smart hand-held process dynamics model building and training method is as follows:

A dynamic model building and training method for a flexible hand-held process based on a deep neural network comprises the following steps:

step 1: the smart hand uses SAC algorithm strategy pi _θ to interact with the object to be grasped and train in the environment, collects system state transfer data and puts the system state transfer data into a sample buffer area;

step 2: setting the category number of fuzzy clustering and carrying out fuzzy clustering on state transition data in a sample buffer area;

Step 3: building a smart manual mechanical model f comprising a state increment direction probability sub-network model f _d and a state increment amplification sub-network model f _a;

step 4: generating sample sampling probability according to membership degree for each class of fuzzy clustering, and sampling to obtain training samples;

Step 5: and training a smart manual mechanical model, and predicting the state of the environmental system.

Further, step 1 uses MuJoCo physical simulation engine to simulate the gripping process of the smart hand and the gripped object, the simulation environment continuously generates gaussian external force and torque noise to be applied to the mass center of the gripped object and the torque of the joints of the smart hand so as to simulate random external force interference in a real scene; with the lapse of time in the simulator, the states of the smart hand and the gripped object change, the whole process accords with a Markov decision process, and the five-tuple < S, A, P, R and gamma > is used for representing a system state space formed by the smart hand and the gripped object, wherein S represents a smart hand joint action space, P represents a state transition probability, R represents a rewarding space, and gamma represents a rewarding discount coefficient.

Further, step 1 uses an Actor network without a model reinforcement learning algorithm SAC as a smart hand control strategy pi _θ, sets a system target g as a gripped object to a random orientation, considers that the simulation ends and resets a simulation environment if the gripped object falls, records state transition Data (s, a, s ', r) of the smart hand and the gripped object in the simulator, wherein s is a current time system state, a is a current time system input action, s' is a next time system state, r is a reward value calculated according to the gripped object, and stores the state transition Data to obtain a Data set Data:

Data＝{(s₁,a₁,s₂,r₁),(s₂,a₂,s₃,r₂),...,(s_n-1,a_n-1,s_n,r_n-1)};

and training the Actor and Critic networks using Data.

Further, the step 2 comprises the following specific steps:

Carrying out fuzzy clustering on the Data set Data, and randomly setting a fuzzy clustering center set C= { C ₁,c₂,...,c_k }, wherein the clustering center C contains the same number of elements as the system state s; calculating the Euclidean distance d between each state s and each clustering center c in the Data set Data to obtain a distance matrix Wherein d _ij＝‖s_i-c_j II represents the Euclidean distance value between the ith state and the jth cluster center; adjusting the fuzzy clustering center set C to enable the sum of squares of all elements of the distance matrix D _t to be minimum; calculating the membership degree u of the state s in the Data set Data to the clustering category to obtain a membership degree matrix/>Wherein the method comprises the steps ofRepresenting the membership of the ith state to the jth cluster category.

Further, the step 3 comprises the following specific steps:

Setting up a state increment direction probability sub-network model f _d by using Pytorch depth neural network frames, wherein the inputs of the state increment direction probability sub-network model f _a;f_d and f _a comprise states s of a dexterous hand and a gripped object system and dexterous hand joint input actions a, the state increment direction probability sub-network model is composed of three linear layers, two ReLu layers and two positive and negative polarity channel layers, and a Sigmoid layer is additionally arranged at the tail part of the network f _d; the outputs of f _d and f _a are the direction and absolute value of the system state change Δs, respectively.

Further, the step 4 comprises the following specific steps:

sampling a dynamic model training sample for each cluster category; calculating the sampling probability p of the Data set Data state s in each category according to the membership matrix U to obtain a probability matrix Wherein the method comprises the steps ofRepresenting the probability that the ith state is sampled in the jth cluster class, if state s _i is taken (s _i,a_i,s′_i) is taken as a training sample.

Further, the step 5 comprises the following specific steps:

training f _d, and setting a loss function as follows:

J_trand(α)＝E_{(s,a,s′)～Date(P)}[(f_d(s,a)-g(s′-s))²]+0.0005‖α‖²

Wherein the method comprises the steps of Alpha is f _d all parameters;

Using a gradient descent method, the optimizer uses Adam;

Training f _a, and setting a loss function as follows:

J_trana(β)＝E_{(s,a,s′)～Date(P)}[(f_a(s,a)-|s′-s|)²]+0.0005‖β‖²

Wherein β is f _a all parameters;

Using a gradient descent method, the optimizer uses Adam;

Further, step 5 uses a smart manual mechanical model f including a state increment direction probability sub-network model f _d and a state increment amplification sub-network model f _a to input the states s of the current smart hand and the object to be gripped and the smart hand joint input actions a into the inputs f _d and f _a to obtain a state increment direction probability value and a state increment value, thereby obtaining a state predicted value at the next moment Wherein dir-f _d (s, a).

The dynamic model building and training method for the smart hand-held process based on the deep neural network has the following advantages: according to the invention, the deep neural network dynamics model comprising the state increment direction probability sub-network model and the state increment amplification sub-network model is designed, and the two sub-deep network models are used for respectively predicting the system state increment direction and the system state increment amplification, so that the accuracy of the dynamics model is improved. And meanwhile, the data samples are subjected to fuzzy clustering, and the training samples are preprocessed, so that the local overfitting phenomenon in the dynamic model training process is reduced. And further, the prediction error of the dynamic model is reduced, the stability is improved, and the control effect is improved on the control algorithm level.

Drawings

FIG. 1 is a block diagram of a dexterous hand grasping process in accordance with the present invention;

FIG. 2 is a fuzzy clustering flow chart in accordance with the present invention;

FIG. 3 is a diagram of the model structure of f _d in the present invention;

FIG. 4 is a diagram of the model structure of f _a in the present invention;

Fig. 5 is a diagram of a smart manual mechanical model f of the present invention using a frame.

Detailed Description

In order to better understand the purposes, structures and functions of the invention, the invention relates to a smart hand-held process dynamics model building and training method based on a deep neural network, which is further described in detail below with reference to the accompanying drawings.

A dynamic model construction and training method for a flexible hand holding process based on a deep neural network is characterized in that system state transfer data formed by a flexible hand and a held object are collected in a real environment, fuzzy clustering pretreatment is carried out on the system state transfer data, the pretreated data are sampled to obtain a dynamic model training sample, a flexible hand mechanical model is trained, and the dynamic model training method is used for predicting the state of the system formed by the flexible hand and the held object at the next moment.

The method comprises the following steps: (1) The smart hand uses SAC algorithm strategy pi _θ to interact with the object to be grasped and train in the environment, collects system state transfer data and puts the system state transfer data into a sample buffer area; (2) Setting the category number of fuzzy clustering and carrying out fuzzy clustering on state transition data in a sample buffer area; (3) Building a smart manual mechanical model f comprising a state increment direction probability sub-network model f _d and a state increment amplification sub-network model f _a; (4) And generating sample sampling probability according to membership degree for each class of fuzzy clustering, and sampling to obtain training samples. (5) And training a smart manual mechanical model, and predicting the state of the environmental system.

The gripping process simulation of the smart hand and the gripped object is preferably performed by using MuJoCo physical simulation engines. The simulation environment can continuously generate the external force and torque noise which are distributed in Gaussian and are applied to the mass center of the gripped object and the torque of the smart hand joint, and random external force interference in a real scene is simulated. As the state of the smart hand and the gripped object changes over time within the simulator, the overall process conforms to a Markov Decision Process (MDP), which can be represented by the quintuple < S, a, P, R, γ >. Wherein S represents a system state space formed by the dexterous hand and the gripped object, A represents a dexterous hand joint action space, P represents a state transition probability, R represents a reward space, and gamma represents a reward discount coefficient.

The Actor network without the model reinforcement learning algorithm SAC is preferentially used as a smart hand control strategy pi _θ, the system target g is set to be a gripping object to a random orientation, and if the gripped object falls, the simulation is regarded as ending and the simulation environment is reset. And recording state transition data (s, a, s ', r) of the dexterous hand and the gripped object in the simulator, wherein s is a system state at the current moment, a is a system input action at the current moment, s' is a system next moment, and r is a reward value calculated according to the gripping target. Storing state transition Data to obtain a Data set Data:

Data＝{(s₁,a₁,s₂,r₁),(s₂,a₂,s₃,r₂),...,(s_n-1,a_n-1,s_n,r_n-1)}.

and training the Actor and Critic networks using Data.

Preferentially, fuzzy clustering is carried out on the Data set Data, and a fuzzy clustering center set C= { C ₁,c₂,...,c_k } is randomly set, wherein the clustering center C contains the same number of elements as the system state s. Calculating the Euclidean distance d between each state s and each clustering center c in the Data set Data to obtain a distance matrixWherein d _ij＝‖s_i-c_j II represents the Euclidean distance value of the ith state and the jth cluster center. The fuzzy clustering center set C is adjusted so that the sum of squares of the elements of the distance matrix D _t is minimized. Calculating the membership degree u of the state s in the Data set Data to the clustering category to obtain a membership degree matrix/> Wherein/>Representing the membership of the ith state to the jth cluster category.

And preferentially sampling the dynamic model training samples of each cluster category. Calculating the sampling probability p of the Data set Data state s in each category according to the membership matrix U to obtain a probability matrixWherein the method comprises the steps ofRepresenting the probability that the ith state is sampled in the jth cluster class, if state s _i is taken (s _i,a_i,s′_i) is taken as a training sample.

The method comprises the steps of constructing a state increment direction probability sub-network model f _d by using a Pytorch depth neural network frame preferentially, inputting states s and smart hand joint input actions a of a smart hand and a gripped object system by using the state increment direction probability sub-network model f _a.f_d and the state increment amplification sub-network model f _d, and forming the state increment direction probability sub-network model by using three linear layers, two ReLu layers and two positive and negative polarity channel layers, wherein a Sigmoid layer is additionally arranged at the tail of the network f _d. The outputs of f _d and f _d are the direction and absolute value of the system state change Δs, respectively.

Training f _d, and setting a loss function as follows:

J_trand(α)＝E_{(s,a,s′)～Date(P)}[(f_d(s,a)-g(s′-s))²]+0.0005‖α‖²

Wherein the method comprises the steps of Alpha is f _d all parameters.

Using the gradient descent method, the optimizer uses Adam.

Training f _a, and setting a loss function as follows:

J_trana(β)＝E_{(s,a,s′)～Date(P)}[(f_a(s,a)-|s′-s|)²]+0.0005‖β‖²

where β is f _a all parameters.

Using the gradient descent method, the optimizer uses Adam.

The dexterous manual mechanical model f comprising a state increment direction probability sub-network model f _d and a state increment amplifying sub-network model f _a is preferentially used, the state s of the current dexterous hand and the object to be gripped and the input action a of the dexterous hand joint are input into f _d and f _a, the state increment direction probability value and the state increment value are obtained, and the state predicted value at the next moment is obtainedWherein dir-f _d (s, a).

The invention will be further elucidated with reference to specific examples.

The invention designs a dynamic model building and training method for a smart hand holding process based on a deep neural network, which is used for the intensive learning of holding objects, and the holding process is shown in a structural block diagram of figure 1.

Step 1: and constructing a simulation environment in the MuJoCo simulator according to the smart hand three-dimensional model, the three-dimensional model of the gripped object and the dynamic parameters. A smart hand joint driver, a joint angle sensor, an angular velocity sensor and a torque sensor, a smart finger tip touch sensor and a gripped object position sensor and a gripped object speed sensor are arranged. External force noise in Gaussian distribution is arranged and applied to unpredictable noise interference in real environments of the mass centers of the smart hand joints and the gripped object. The system state s comprises the articulation angle, the angular velocity, the torque, the position and the speed of the gripped object and the contact force of the smart finger tip. The system input action a contains the smart hand joint driver output value.

Step 2: using the strategy pi _θ, generating a smart hand joint driver action a according to the current system state s in a MuJoCo simulation environment, and performing simulation to obtain the state s' of the system at the next moment, calculating to obtain the current rewarding value r according to the grasping target, setting the state transition probability P to be 1, namely setting the determined environment, and setting the rewarding discount coefficient gamma to be 0.99. Wherein the simulation time step is 0.02 seconds. MDP data is stored in a sample buffer. And training an Actor and Critic model according to the data in the sample buffer area, and adopting a SAC algorithm.

Step 3: and carrying out fuzzy clustering on the data in the sample cache region, firstly determining the number of category center points of the fuzzy clustering, and randomizing the category centers. And then, the Euclidean distance between the sample and the class center is calculated as shown in fig. 2, and the class center is updated according to the square value of the absolute total Euclidean distance until convergence. And finally, calculating the membership degree of the corresponding category according to the Euclidean distance between the sample and the category center, and then obtaining the probability of the sample being sampled in the category according to the membership degree.

Step 4: and constructing a state increment direction probability sub-network model f _d by using a Pytorch depth neural network framework, wherein the structure of the state increment direction probability sub-network model f _a.f_d is shown in fig. 3, and the structure of the state increment amplification sub-network model f _a is shown in fig. 4. The input of f _d and f _a comprises the state s of the system of the dexterous hand and the object to be gripped and the input action a of the dexterous hand joint, and consists of three linear layers, two ReLu layers and two positive and negative polarity channel layers, and f _d is additionally provided with a Sigmoid layer at the tail part of the network. The outputs of f _d and f _a are the direction and absolute value of the system state change Δs, respectively. In step 3, training samples are sampled for each category to obtain training samples, and f _d and f _a are trained respectively.

Step 5: a smart manual mechanical model f is used that includes a state delta direction probability subnetwork model f _d and a state delta amplifying subnetwork model f _a. As shown in fig. 5, the state s is obtained by sampling the sample buffer, and then the action a is generated according to the current policy pi _θ. Inputting the sampled state s and action a into f _d and f _a to obtain a state increment direction probability value and a state increment value, and further obtaining a state predicted value at the next momentWherein dir-f _d (s, a). And calculating to obtain the rewarding value/>, according to the set grasping targetUse/>Training the Actor and Critic networks.

It will be understood that the application has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the application. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the application without departing from the essential scope thereof. Therefore, it is intended that the application not be limited to the particular embodiment disclosed, but that the application will include all embodiments falling within the scope of the appended claims.

Claims

1. The utility model relates to a deep neural network-based smart hand-held process dynamics model building and training method, which is characterized by comprising the following steps:

Using MuJoCo physical simulation engine to simulate the holding process of the smart hand and the held object, continuously generating Gaussian external force and torque noise in the simulation environment, and applying the Gaussian external force and torque noise on the mass center of the held object and the torque of the joints of the smart hand so as to simulate random external force interference in a real scene; along with the internal time transition of the simulator, the states of the smart hand and the gripped object change, the whole process accords with a Markov decision process, and the five-tuple < S, A, P, R and gamma > is used for representing, wherein S represents a system state space formed by the smart hand and the gripped object, A represents a smart hand joint action space, P represents a state transition probability, R represents a rewarding space, and gamma represents a rewarding discount coefficient;

Using an Actor network without a model reinforcement learning algorithm SAC as a smart hand control strategy pi _θ, setting a system target g as a gripped object to a random orientation, if the gripped object falls, treating as the end of the simulation and resetting a simulation environment, recording state transition Data (s, a, s ', r) of the smart hand and the gripped object in a simulator, wherein s is a current moment system state, a is a current moment system input action, s' is a next moment system state, r is a reward value calculated according to the gripped object, and storing the state transition Data to obtain a Data set Data:

Data＝{(s₁,a₁,s₂,r₁),(s₂,a₂,s₃,r₂),…,(s_n-1,a_n-1,s_n,r_n-1)};

training an Actor and a Critic network by using Data;

Carrying out fuzzy clustering on the Data set Data, and randomly setting a fuzzy clustering center set C= { C ₁,c₂,…,c_k }, wherein the clustering center C contains the same number of elements as the system state s; calculating the Euclidean distance d between each state s and each clustering center c in the Data set Data to obtain a distance matrix Wherein d _ij＝‖s_i-c_j II represents the Euclidean distance value between the ith state and the jth cluster center; adjusting the fuzzy clustering center set C to enable the sum of squares of all elements of the distance matrix D _t to be minimum; calculating the membership degree u of the state s in the Data set Data to the clustering category to obtain a membership degree matrix/>Wherein the method comprises the steps ofRepresenting the membership of the ith state to the jth cluster class;

Setting up a state increment direction probability sub-network model f _d by using Pytorch depth neural network frames, wherein the inputs of the state increment direction probability sub-network model f _a;f_d and f _a comprise states s of a dexterous hand and a gripped object system and dexterous hand joint input actions a, the state increment direction probability sub-network model is composed of three linear layers, two ReLu layers and two positive and negative polarity channel layers, and a Sigmoid layer is additionally arranged at the tail part of the network f _d; the outputs of f _d and f _a are the direction and absolute value of the system state change Δs;

sampling a dynamic model training sample for each cluster category; calculating the sampling probability p of the Data set Data state s in each category according to the membership matrix U to obtain a probability matrix Wherein/>Representing the probability that the ith state is sampled in the jth cluster class, if state s _i is sampled (s _i,a_i,s′_i) as a training sample;

step 5: training a smart manual mechanical model, and predicting the state of an environmental system;

training f _d, and setting a loss function as follows:

J_trand(α)＝E_{(s,a,s′)～Date(P)}[(f_d(s,a)-g(s′-s))²]+0.0005‖α‖²

Wherein the method comprises the steps of Alpha is f _d all parameters;

Using a gradient descent method, the optimizer uses Adam;

Training f _a, and setting a loss function as follows:

J_trana(β)＝E_{(s,a,s′)～Date(P)}[(f_a(s,a)-|s′-s|)²]+0.0005‖β‖²

Wherein β is f _a all parameters;

Using a gradient descent method, the optimizer uses Adam;

using a dexterous manual mechanical model f comprising a state increment direction probability sub-network model f _d and a state increment amplification sub-network model f, inputting the states s of the current dexterous hand and the object to be gripped and the dexterous hand joint input actions a into f _d and f _a to obtain a state increment direction probability value and a state increment amplitude value, thereby obtaining a state predicted value at the next moment Wherein dir-f _d (s, a).