CN110297490A

CN110297490A - Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement

Info

Publication number: CN110297490A
Application number: CN201910523043.0A
Authority: CN
Inventors: 张夷斋; 王文卉; 黄攀峰; 孟中杰; 常海涛
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2019-10-01
Anticipated expiration: 2039-06-17
Also published as: CN110297490B

Abstract

The heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement that the present invention relates to a kind of, an initial module robot modeling given first and target configuration, input module sum N, and establish by initialization procedure the graph structure of module；Using initial configuration as root node, the search of Monte Carlo tree is established, and is stopped search when reaching termination condition (find target configuration or carried out n times exploration).Planning path is provided after search every time, and saves sample；After sample number reaches given value, sample input neural network is trained, and update training parameter；After undated parameter, monte carlo search is carried out again.This time the average step number of search result should be less, after the completion of each search, updates the least planning path of step number.

Description

Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement

Technical field

The invention belongs to artificial intelligence planning fields, and in particular to exist using nitrification enhancement optimized modular injection mold robot Planning in autonomous deformation process.

Background technique

As robot is in the extensive use of every field, robot is more and more applied in non-structure environment and completes Operation, and for unknown working environment and different tasks, need robot that can reach by changing its configuration To the requirement for adapting to environment, this configuration that can dynamically, independently change referred to as is self-possessed with adapting to the robot of task needs Structure modularization robot.Self-Reconfigurable Modular Robot is a series of, the unit module composition of Various Functions simple by structures. In order to realize better task and environmental suitability, when multi-module system operation when, each modular unit pass through respectively connection with Disconnection movement, realizes change of configuration, independently to meet the needs of environment and task.With ordinary robot except that it puts The limitation of fixed configuration has been taken off, and can independently complete the variation of configuration, therefore for operation under circumstances not known, such as calamity Field is explored in hardly possible rescue, nuclear power station maintenance and space flight has significant advantage.

The self-reorganization robot deformation process being made of multimode is as shown in Figure 1.Although modularization robot can basis The difference of environment and task changes structure, to reach optimal configuration, but needs in reconstitution movement to each A participation module carries out motion planning, and the mode converted from current configuration to target configuration is not unique, theoretically configuration Number of modules is more, and the solution that can reach target configuration is more, therefore how from participation number of modules is reduced, reduces module and carry step Number shortens reconstitution time etc. and seeks to reconstruct optimal solution, becomes a critical issue of via Self-reconfiguration research field.

What general via Self-reconfiguration algorithm was directed to is all the relatively simple modularization robot of configuration, and emphasis is motion process In deformation.The robot complicated for configuration, module is more, existing algorithm are to guarantee that initial configuration arrives target configuration There is solution, neutral configuration can be introduced as transition, it is excessive that this will will lead to the mobile number of module, inefficiency.In how eliminating Between configuration, realize fast transition of the space module robot directly from initial configuration to target configuration, be urgently to be resolved ask Topic.

Intelligent algorithm flourishes in recent years, and wherein deeply learning algorithm obtains in the fields such as machine planning and control Extensive research and application are arrived.Especially AlphaGo's and AlphaZero is announced to the world splendidly, and deeply is learnt in task The ability orientation of planning aspect has arrived ultimate attainment.

Summary of the invention

Technical problems to be solved

In order to avoid the shortcomings of the prior art, promoting via Self-reconfiguration efficiency, the present invention proposes a kind of based on intensified learning The heterogeneous module robot via Self-reconfiguration planing method of algorithm.

Technical solution

A kind of heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement, the modular robot Specification it is identical, and each module at least can be used for docking there are two face；It is characterized by the following steps:

Step 1: the initial module robot modeling and target configuration that a given number of modules is N, respectively by two structures Type is converted into N × N × N matrix, and initializes neural network parameter:

Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization machine Device people's configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set It is set to 0；

Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M The size of a interface, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network；

Being averaged for the sum that Reward Program R is all module distance differences is defined, is always defined as the position of No. 1 module Datum mark, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No. 1 module, and the calculation formula of Reward Program R is such as Under:

Wherein x_i′、y_i' and z_i' it is the cartesian coordinate that number is i module in target configuration, x respectively_i、y_iAnd z_iRespectively It is the cartesian coordinate that number is i module in current configuration；

The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value W is the summation of value assessment v, and defining n is movement node visit number, wherein Q=W/n；

Definition vector P is that a pairs of movement can be performed under the Policy evaluation value that policy value network exports and a certain configuration The dimension of the prior probability answered, vector P is exactly the dimension of motion space, specially N × (N-1) × M；

Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore calling Meng Teka The extension of Luo Shu and evaluation stage obtain commenting for prior probability P under initial configuration and current state by policy value network Point v, and be recompensed function R (s) by environment, the child node of initial configuration then can be extended, and will deposit in child node The information of storage is initialized；After the completion of extension, the passback step of Monte Carlo tree is carried out, updates leaf node to root node The statistical information saved in each node on path；Root node has been unfolded at this time, so that it may be carried out in the tree of Monte Carlo Choice phase, select the highest branch of TOP SCORES as next step act；It next is exactly to repeat selection, extension And three certain numbers of step of assessment and passback；Then each configuration is obtained according to the access times of current configuration inferior division Access probability, the movement for selecting the highest point of access probability to execute as next step；Target structure is found after search every time Type reaches the setting step number upper limit, provides planning path, and save planning sample, sample form is (s, π, z)；

The Monte Carlo tree includes selection, extension and assessment and passback；

A) it selects

Choice phase since a configuration node, along established tree construction, selects corresponding f (s)=Q+U+R most Big branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:

Wherein c_puctThe balance between utilization is explored in control, and n (s, b) indicates the access times of father node, n (s, a) table Show the access times of a movement child node under the father node；

Whole process stops after encountering totally unknown configuration；

B) it extends and assesses

When encountering totally unknown configuration, regulative strategy value network, the configuration matrix s of current configuration and target configuration As incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide back The assessment of function R (s) as current configuration is reported, all feasible movements and corresponding prior probability under current configuration are obtained Afterwards, so that it may extend the child node of unknown configuration, and by the information initializing stored in child node be n (s, a)=0, W (and s, a) =0, Q (s, a)=0, P (s, a)=p, R (s)=R；

C) it returns

When extending with after the completion of evaluation stage, the system saved in each node on leaf node to root node path is updated Information is counted, statistical information includes node visit number, total action value and average action value, more new formula are as follows:

N=n+1

W=W+v

Q=(W+v)/(n+1)

D) it executes

After repeating the certain number of a~c, the visit of each configuration is obtained according to the access times of current configuration inferior division Ask probability, the movement for selecting the highest point of access probability to execute as next step；

Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network being carried out Training, training objective are to minimize loss function l=(z-v)²-π^Tlog p+c||θ||², and nerve is updated after completing training Network parameter；

The input of the policy value network is current block configuration matrix and object module configuration matrix, is exported as this The probability P of each possible action and condition grading v under configuration；

The current block configuration matrix and object module configuration matrix of input first pass around 256 3 × 3, stride be 1 The convolutional layer that convolution kernel is constituted, then by batch normalized and nonlinear activation function ReLU output；Subsequent pass through one String residual error module, in each residual error inside modules, input signal is successively passed through by 256 3 × 3, the convolution kernel structure that stride is 1 At convolutional layer, batch normalization layer, nonlinear activation function ReLU, by 256 3 × 3, the volume that constitutes of convolution kernel that stride is 1 Lamination, batch normalization layer are finally exported by nonlinear activation function ReLU then with the direct-connected Signal averaging of importation；

After by a string of residual error module, signal finally enters output module, output module be divided into strategy output and Value output two parts, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, then by criticizing Normalization and ReLU activation primitive, finally by full articulamentum output N × (N-1) × M dimension vector, correspond to it is all can The probability of energy shift action；Value output par, c first passes around the convolutional layer comprising 11 × 1 convolution kernel, then returns by batch One changes with ReLU activation primitive, and the full articulamentum and ReLU activation primitive for tieing up output followed by 256 finally pass through one 1 dimension again The full articulamentum of output exports the value assessment for current configuration；

Step 4: after parameter updates, carrying out monte carlo search since initial configuration again；Step 2-3 is repeated, constantly Iterative search optimal result；After the completion of each search, the least planning path of step number is updated.

A string of residual error modules in the step 3 are 3-5.

Beneficial effect

The invention proposes a kind of algorithms that modularization robot via Self-reconfiguration is solved the problems, such as using intensified learning, known first Under the premise of beginning configuration and target configuration, the efficient carrying planning of module is obtained, via Self-reconfiguration efficiency is obviously improved.

Detailed description of the invention

(the figure shows the process of two dimensional configurations via Self-reconfiguration transformation, actual algorithm is complete for Fig. 1 self-reconfigurable mechanism schematic diagram It can solve the via Self-reconfiguration problem of 3-d modelling)

(wherein the part a represents the choice phase to Fig. 2 Monte carlo algorithm process, and the part b represents extension and evaluation stage, c Part represents the passback stage, and the part d represents the execution stage)

Fig. 3 overall algorithm block diagram

Specific embodiment

Now in conjunction with embodiment, attached drawing, the invention will be further described:

The purpose of the present invention is being the modularization robot of n (n > 100) a module composition for quantity, extensive chemical is utilized It practises algorithm and realizes that Self-Reconfigurable Modular Robot dismounts order from arbitrary initial configuration to the module of specified target configuration, so that It is few as far as possible to assemble number, promotes via Self-reconfiguration efficiency.

The object of algorithm application is mainly the robot by the identical module composition of specification, and each module at least two A face can be used for docking.

To achieve the goals above, the technical solution adopted in the present invention the following steps are included:

Step 1: the initial module robot modeling and target configuration that a given number of modules is N, respectively by two Configuration is converted into N × N × N matrix, and initializes neural network parameter.

Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore Meng Teka is called The extension of Luo Shu and evaluation stage obtain commenting for prior probability P under initial configuration and current state by policy value network Point v, and be recompensed function R (s) by environment, the child node of initial configuration then can be extended, and will deposit in child node The information of storage is initialized.After the completion of extension, the passback step of Monte Carlo tree is carried out, updates leaf node to root node The statistical information saved in each node on path.Root node has been unfolded at this time, so that it may be carried out in the tree of Monte Carlo Choice phase, select the highest branch of TOP SCORES as next step act.It next is exactly to repeat selection, extension And three certain numbers of step of assessment and passback.Then each configuration is obtained according to the access times of current configuration inferior division Access probability, the movement for selecting the highest point of access probability to execute as next step.(target structure is found after search every time Type reaches the setting step number upper limit), planning path is provided, and save planning sample, sample form is (s, π, z).

Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network is carried out Training, training objective are to minimize loss function l=(z-v)²-π^Tlog p+c||θ², and nerve net is updated after completing training Network parameter.

Step 4: after parameter updates, monte carlo search is carried out since initial configuration again.It repeats Step 2: three, Continuous iterative search optimal result.After the completion of each search, the least planning path of step number is updated.With trained continuous Optimization, the step number for completing configuration conversion are fewer.

Specific step is as follows:

Step 1: parameter definition

Defining N is the module total number for constituting space module robot.

Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization machine Device people's configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set It is set to 0.

Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M The size of a interface, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network.

Being averaged for the sum that Reward Program R is all module distance differences is defined, is always defined as the position of No. 1 module Datum mark, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No.1 module, and the specific calculation formula of R is as follows:

Wherein x_i′、y_i' and z_i' it is the cartesian coordinate that number is i cell in target configuration, x respectively_i、y_iAnd z_iRespectively It is the cartesian coordinate that number is i cell in current configuration.

The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value W is the summation of value assessment v, and defining n is movement node visit number, wherein Q=W/n.

Definition vector P is the Policy evaluation value of policy value network output, and the dimension of the vector is exactly the dimension of motion space Number, specially N × (N-1) × M

Step 2: Monte Carlo tree is established

The information for how to select movement a under one configuration s of decision is saved in each node of Monte Carlo tree.This A little information include the access movement node a at configuration s frequency n (s, a), Reward Program R (s), total action value W (s, a), Average action value Q (s, a) and at configuration s selection movement a prior probability P (s, a).

The process for establishing tree includes selection, extension and assessment and passback.It is available every by continuous iterative process Movement probability distribution π under a configuration s, performs the next step movement according to the probability distribution of π.

A) it selects

Wherein c_puctThe balance between utilization is explored in control, and (s, a) is prior probability to P, and n (s, b) indicates father node Access times, (s a) indicates the access times of a movement child node under the father node to n.

Whole process stops after encountering totally unknown configuration.Selection course has comprehensively considered action value, node Exploration degree and the node are realizing the success rate in configuration conversion process.

B) it extends and assesses

When encountering totally unknown configuration, regulative strategy value network, the configuration matrix s of current configuration and target configuration As incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide back The assessment of function R (s) as current configuration is reported, wherein vector P is that the corresponding probability of movement a each can be performed under current configuration. Obtain under current configuration it is all it is feasible movement and corresponding prior probability after, so that it may extend the child node of unknown configuration, And by the information initializing stored in child node be n (s, a)=0, W (s, a)=0, Q (s, a)=0, P (s, a)=p, R (s)= R

C) it returns

When extending with after the completion of evaluation stage, the system saved in each node on leaf node to root node path is updated Information is counted, node visit number, total action value and average action value, more new formula are specifically included are as follows:

N=n+1

W=W+v

Q=(W+v)/(n+1)

D) it executes

Repeating a~step c m times, (m is set as 400 here, for according to configuration complexity and cell number tune It is whole) after, the access probability of each configuration is obtained according to the access times of current configuration inferior division, selects access probability highest Point as in next step execution movement.

Iteration a~Step d, until finding target configuration or reaching iteration upper limit k (k is according to number of modules and configuration Complexity determines), after reaching stopping criterion for iteration (find target configuration or reach the iteration upper limit), iterative process is saved For a series of samples, sample form is a tuple (s, π, z), and wherein s is the description of current configuration, and π is under the configuration The movement probability distribution that MCTS is returned, z is evaluation to this innings after one innings of planning, if finding target in iteration upper limit k Configuration, then z is 1, if being more than that iteration upper limit k does not search out target configuration, z 0 also.These samples policy value later The training and parameter optimization of network.

Step 3: policy value network establishment

By the Monte Carlo tree that iterates, available training sample comes optimization neural network, the generating process of sample See attached drawing 2.

Here the input of definition strategy value network is current block configuration matrix and object module configuration matrix, exports and is The probability P (policy section) of each possible action and condition grading v (value part) under the configuration.

The convolutional layer that the convolution kernel that the configuration matrix s of input first passes around 256 3 × 3, stride is 1 is constituted, then passes through Cross batch normalized and nonlinear activation function ReLU output.Subsequent pass through a string of residual error modules (3-5).Each Residual error inside modules, input signal successively pass through convolutional layer, batch normalizing being made of 256 3 × 3, the convolution kernel that stride is 1 Change layer, nonlinear activation function ReLU, the convolutional layer being made of 256 3 × 3, the convolution kernel that stride is 1, batch normalization layer, Then it with the direct-connected Signal averaging of importation, is finally exported by nonlinear activation function ReLU.

After by a series of residual error module, signal finally enters output module.Output module is divided into tactful output Two parts are exported with value, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, are then passed through Normalization and ReLU activation primitive are criticized, finally by the vector of full articulamentum output N × (N-1) × M dimension, is corresponded to all The probability of possible shift action.Value output par, c first passes around the convolutional layer comprising 11 × 1 convolution kernel, then by criticizing Normalization and ReLU activation primitive finally pass through one 1 followed by the full articulamentum and ReLU activation primitive of 256 dimension outputs again The full articulamentum of output is tieed up, the value assessment for current configuration is exported.

In iteration Monte Carlo during tree, a part of sample is had accumulated, when sample number reaches given value (one As for 100 or it is more) when, so that it may start to train neural network.The target of optimization neural network is exactly to allow neural network The s feedback function v of the movement Probability p of prediction and configuration respectively in sample (s, π, z) π and z be fitted, allowable loss thus Function are as follows:

L=(z-v)²-π^Tlog p+c||θ||²

The target of policy value network training is exactly that above-mentioned loss function is minimized on the data set of preservation, and wherein θ is just It is neural network parameter, c is the parameter for controlling regularization degree.

Step 4: Shortest Path Searching result is saved

After carrying out primary complete monte carlo search (i.e. arrival target configuration), will save from initial configuration to The planning path and total step number of target configuration.In iterative process, only retain the least primary search optimal result of step number.It is aobvious So, as the number of iterations increases, total step number is fewer, and search result is more excellent.

Claims

1. a kind of heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement, the modular robot Specification is identical, and each module at least can be used for docking there are two face；It is characterized by the following steps:

Step 1: the initial module robot modeling and target configuration that a given number of modules is N respectively turn two configurations N × N × N matrix is turned to, and initializes neural network parameter:

Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization robot Configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set as 0；

Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M a right The size of junction, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network；

Being averaged for the sum that Reward Program R is all module distance differences is defined, it always will be on the basis of the position regulation of No. 1 module Point, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No. 1 module, and the calculation formula of Reward Program R is as follows:

Wherein x_i′、y_i' and z_i' it is the cartesian coordinate that number is i module in target configuration, x respectively_i、y_iAnd z_iWork as respectively Number is the cartesian coordinate of i module in preceding configuration；

The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value W and is The summation of value assessment v, defining n is movement node visit number, wherein Q=W/n；

Definition vector P is that the corresponding elder generation of movement a can be performed under the Policy evaluation value that policy value network exports and a certain configuration Probability is tested, the dimension of vector P is exactly the dimension of motion space, specially N × (N-1) × M；

Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore calling Monte Carlo tree Extension and evaluation stage, the scoring v of the prior probability P under initial configuration and current state is obtained by policy value network, and It is recompensed function R (s) by environment, then can extend the child node of initial configuration, and the letter that will be stored in child node Breath is initialized；After the completion of extension, the passback step of Monte Carlo tree is carried out, is updated on leaf node to root node path The statistical information saved in each node；Root node has been unfolded at this time, so that it may carry out the selection rank in the tree of Monte Carlo Section selects the highest branch of TOP SCORES to act as next step；It next is exactly to repeat selection, extension and assessment and return Pass three certain numbers of step；Then the access probability of each configuration is obtained according to the access times of current configuration inferior division, selected Select the movement that the highest point of access probability is executed as next step；Target configuration is found after search every time or reaches setting The step number upper limit provides planning path, and saves planning sample, and sample form is (s, π, z)；

A) it selects

Choice phase since a configuration node, along established tree construction, selects corresponding maximum point of f (s)=Q+U+R Branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:

Wherein c_puctThe balance between utilization is explored in control, and n (s, b) indicates the access times of father node, and (s a) indicates the father to n A acts the access times of child node under node；

Whole process stops after encountering totally unknown configuration；

B) it extends and assesses

When encountering totally unknown configuration, regulative strategy value network, using the configuration matrix s of current configuration and target configuration as Incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide return letter Number R (s) assessment as current configuration, obtain under current configuration it is all it is feasible act and corresponding prior probability after, just Can extend the child node of unknown configuration, and by the information initializing stored in child node be n (s, a)=0, W (and s, a)=0, Q (s, a)=0, P (s, a)=p, R (s)=R；

C) it returns

Believe when extending and after the completion of evaluation stage, updating the statistics saved in each node on leaf node to root node path Breath, statistical information include node visit number, total action value and average action value, more new formula are as follows:

N=n+1

W=W+v

Q=(W+v)/(n+1)

D) it executes

After repeating the certain number of a~c, the access for obtaining each configuration according to the access times of current configuration inferior division is general Rate, the movement for selecting the highest point of access probability to execute as next step；

Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network is trained, Training objective is to minimize loss function l=(z-v)²-π^Tlogp+c||θ||², and neural network ginseng is updated after completing training Number；

The input of the policy value network is current block configuration matrix and object module configuration matrix, is exported as the configuration The probability P and condition grading v of each lower possible action；

The convolution kernel that the current block configuration matrix and object module configuration matrix of input first pass around 256 3 × 3, stride is 1 The convolutional layer of composition, then by batch normalized and nonlinear activation function ReLU output；Subsequent pass through a string of residual errors Module, in each residual error inside modules, input signal successively passes through the convolution being made of 256 3 × 3, the convolution kernel that stride is 1 Layer, nonlinear activation function ReLU, the convolutional layer being made of 256 3 × 3, the convolution kernel that stride is 1, is criticized and is returned batch normalization layer One changes layer, then with the direct-connected Signal averaging of importation, finally exports by nonlinear activation function ReLU；

After the residual error module by a string, signal finally enters output module, and output module is divided into strategy output and value Two parts are exported, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, then by batch normalizing Change and ReLU activation primitive corresponds to all possible mobile finally by the vector of full articulamentum output N × (N-1) × M dimension The probability of movement；Value output par, c first pass around the convolutional layer comprising 11 × 1 convolution kernel, then by batch normalization and ReLU activation primitive is finally exported by one 1 dimension again followed by the full articulamentum and ReLU activation primitive of 256 dimension outputs Full articulamentum exports the value assessment for current configuration；

Step 4: after parameter updates, carrying out monte carlo search since initial configuration again；Repeat step 2-3, continuous iteration Search for optimal result；After the completion of each search, the least planning path of step number is updated.

2. a kind of via Self-reconfiguration planning side, heterogeneous module robot based on nitrification enhancement according to claim 1 Method, it is characterised in that a string of residual error modules in the step 3 are 3-5.