CN110297490A - Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement - Google Patents
Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement Download PDFInfo
- Publication number
- CN110297490A CN110297490A CN201910523043.0A CN201910523043A CN110297490A CN 110297490 A CN110297490 A CN 110297490A CN 201910523043 A CN201910523043 A CN 201910523043A CN 110297490 A CN110297490 A CN 110297490A
- Authority
- CN
- China
- Prior art keywords
- configuration
- module
- value
- node
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 238000012549 training Methods 0.000 claims abstract description 12
- 230000033001 locomotion Effects 0.000 claims description 37
- 230000009471 action Effects 0.000 claims description 23
- 239000011159 matrix material Substances 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 21
- 230000004913 activation Effects 0.000 claims description 18
- 238000011156 evaluation Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 8
- 230000008859 change Effects 0.000 claims description 4
- 239000000203 mixture Substances 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000003032 molecular docking Methods 0.000 claims description 3
- 241000208340 Araliaceae Species 0.000 claims 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims 1
- 235000003140 Panax quinquefolius Nutrition 0.000 claims 1
- 235000008434 ginseng Nutrition 0.000 claims 1
- 239000000243 solution Substances 0.000 description 5
- 238000005457 optimization Methods 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 238000012804 iterative process Methods 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000002347 injection Methods 0.000 description 1
- 239000007924 injection Substances 0.000 description 1
- 238000003475 lamination Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/02—Control of position or course in two dimensions
- G05D1/021—Control of position or course in two dimensions specially adapted to land vehicles
- G05D1/0212—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
- G05D1/0221—Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
The heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement that the present invention relates to a kind of, an initial module robot modeling given first and target configuration, input module sum N, and establish by initialization procedure the graph structure of module;Using initial configuration as root node, the search of Monte Carlo tree is established, and is stopped search when reaching termination condition (find target configuration or carried out n times exploration).Planning path is provided after search every time, and saves sample;After sample number reaches given value, sample input neural network is trained, and update training parameter;After undated parameter, monte carlo search is carried out again.This time the average step number of search result should be less, after the completion of each search, updates the least planning path of step number.
Description
Technical field
The invention belongs to artificial intelligence planning fields, and in particular to exist using nitrification enhancement optimized modular injection mold robot
Planning in autonomous deformation process.
Background technique
As robot is in the extensive use of every field, robot is more and more applied in non-structure environment and completes
Operation, and for unknown working environment and different tasks, need robot that can reach by changing its configuration
To the requirement for adapting to environment, this configuration that can dynamically, independently change referred to as is self-possessed with adapting to the robot of task needs
Structure modularization robot.Self-Reconfigurable Modular Robot is a series of, the unit module composition of Various Functions simple by structures.
In order to realize better task and environmental suitability, when multi-module system operation when, each modular unit pass through respectively connection with
Disconnection movement, realizes change of configuration, independently to meet the needs of environment and task.With ordinary robot except that it puts
The limitation of fixed configuration has been taken off, and can independently complete the variation of configuration, therefore for operation under circumstances not known, such as calamity
Field is explored in hardly possible rescue, nuclear power station maintenance and space flight has significant advantage.
The self-reorganization robot deformation process being made of multimode is as shown in Figure 1.Although modularization robot can basis
The difference of environment and task changes structure, to reach optimal configuration, but needs in reconstitution movement to each
A participation module carries out motion planning, and the mode converted from current configuration to target configuration is not unique, theoretically configuration
Number of modules is more, and the solution that can reach target configuration is more, therefore how from participation number of modules is reduced, reduces module and carry step
Number shortens reconstitution time etc. and seeks to reconstruct optimal solution, becomes a critical issue of via Self-reconfiguration research field.
What general via Self-reconfiguration algorithm was directed to is all the relatively simple modularization robot of configuration, and emphasis is motion process
In deformation.The robot complicated for configuration, module is more, existing algorithm are to guarantee that initial configuration arrives target configuration
There is solution, neutral configuration can be introduced as transition, it is excessive that this will will lead to the mobile number of module, inefficiency.In how eliminating
Between configuration, realize fast transition of the space module robot directly from initial configuration to target configuration, be urgently to be resolved ask
Topic.
Intelligent algorithm flourishes in recent years, and wherein deeply learning algorithm obtains in the fields such as machine planning and control
Extensive research and application are arrived.Especially AlphaGo's and AlphaZero is announced to the world splendidly, and deeply is learnt in task
The ability orientation of planning aspect has arrived ultimate attainment.
Summary of the invention
Technical problems to be solved
In order to avoid the shortcomings of the prior art, promoting via Self-reconfiguration efficiency, the present invention proposes a kind of based on intensified learning
The heterogeneous module robot via Self-reconfiguration planing method of algorithm.
Technical solution
A kind of heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement, the modular robot
Specification it is identical, and each module at least can be used for docking there are two face;It is characterized by the following steps:
Step 1: the initial module robot modeling and target configuration that a given number of modules is N, respectively by two structures
Type is converted into N × N × N matrix, and initializes neural network parameter:
Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization machine
Device people's configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set
It is set to 0;
Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M
The size of a interface, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network;
Being averaged for the sum that Reward Program R is all module distance differences is defined, is always defined as the position of No. 1 module
Datum mark, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No. 1 module, and the calculation formula of Reward Program R is such as
Under:
Wherein xi′、yi' and zi' it is the cartesian coordinate that number is i module in target configuration, x respectivelyi、yiAnd ziRespectively
It is the cartesian coordinate that number is i module in current configuration;
The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value
W is the summation of value assessment v, and defining n is movement node visit number, wherein Q=W/n;
Definition vector P is that a pairs of movement can be performed under the Policy evaluation value that policy value network exports and a certain configuration
The dimension of the prior probability answered, vector P is exactly the dimension of motion space, specially N × (N-1) × M;
Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore calling Meng Teka
The extension of Luo Shu and evaluation stage obtain commenting for prior probability P under initial configuration and current state by policy value network
Point v, and be recompensed function R (s) by environment, the child node of initial configuration then can be extended, and will deposit in child node
The information of storage is initialized;After the completion of extension, the passback step of Monte Carlo tree is carried out, updates leaf node to root node
The statistical information saved in each node on path;Root node has been unfolded at this time, so that it may be carried out in the tree of Monte Carlo
Choice phase, select the highest branch of TOP SCORES as next step act;It next is exactly to repeat selection, extension
And three certain numbers of step of assessment and passback;Then each configuration is obtained according to the access times of current configuration inferior division
Access probability, the movement for selecting the highest point of access probability to execute as next step;Target structure is found after search every time
Type reaches the setting step number upper limit, provides planning path, and save planning sample, sample form is (s, π, z);
The Monte Carlo tree includes selection, extension and assessment and passback;
A) it selects
Choice phase since a configuration node, along established tree construction, selects corresponding f (s)=Q+U+R most
Big branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:
Wherein cpuctThe balance between utilization is explored in control, and n (s, b) indicates the access times of father node, n (s, a) table
Show the access times of a movement child node under the father node;
Whole process stops after encountering totally unknown configuration;
B) it extends and assesses
When encountering totally unknown configuration, regulative strategy value network, the configuration matrix s of current configuration and target configuration
As incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide back
The assessment of function R (s) as current configuration is reported, all feasible movements and corresponding prior probability under current configuration are obtained
Afterwards, so that it may extend the child node of unknown configuration, and by the information initializing stored in child node be n (s, a)=0, W (and s, a)
=0, Q (s, a)=0, P (s, a)=p, R (s)=R;
C) it returns
When extending with after the completion of evaluation stage, the system saved in each node on leaf node to root node path is updated
Information is counted, statistical information includes node visit number, total action value and average action value, more new formula are as follows:
N=n+1
W=W+v
Q=(W+v)/(n+1)
D) it executes
After repeating the certain number of a~c, the visit of each configuration is obtained according to the access times of current configuration inferior division
Ask probability, the movement for selecting the highest point of access probability to execute as next step;
Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network being carried out
Training, training objective are to minimize loss function l=(z-v)2-πTlog p+c||θ||2, and nerve is updated after completing training
Network parameter;
The input of the policy value network is current block configuration matrix and object module configuration matrix, is exported as this
The probability P of each possible action and condition grading v under configuration;
The current block configuration matrix and object module configuration matrix of input first pass around 256 3 × 3, stride be 1
The convolutional layer that convolution kernel is constituted, then by batch normalized and nonlinear activation function ReLU output;Subsequent pass through one
String residual error module, in each residual error inside modules, input signal is successively passed through by 256 3 × 3, the convolution kernel structure that stride is 1
At convolutional layer, batch normalization layer, nonlinear activation function ReLU, by 256 3 × 3, the volume that constitutes of convolution kernel that stride is 1
Lamination, batch normalization layer are finally exported by nonlinear activation function ReLU then with the direct-connected Signal averaging of importation;
After by a string of residual error module, signal finally enters output module, output module be divided into strategy output and
Value output two parts, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, then by criticizing
Normalization and ReLU activation primitive, finally by full articulamentum output N × (N-1) × M dimension vector, correspond to it is all can
The probability of energy shift action;Value output par, c first passes around the convolutional layer comprising 11 × 1 convolution kernel, then returns by batch
One changes with ReLU activation primitive, and the full articulamentum and ReLU activation primitive for tieing up output followed by 256 finally pass through one 1 dimension again
The full articulamentum of output exports the value assessment for current configuration;
Step 4: after parameter updates, carrying out monte carlo search since initial configuration again;Step 2-3 is repeated, constantly
Iterative search optimal result;After the completion of each search, the least planning path of step number is updated.
A string of residual error modules in the step 3 are 3-5.
Beneficial effect
The invention proposes a kind of algorithms that modularization robot via Self-reconfiguration is solved the problems, such as using intensified learning, known first
Under the premise of beginning configuration and target configuration, the efficient carrying planning of module is obtained, via Self-reconfiguration efficiency is obviously improved.
Detailed description of the invention
(the figure shows the process of two dimensional configurations via Self-reconfiguration transformation, actual algorithm is complete for Fig. 1 self-reconfigurable mechanism schematic diagram
It can solve the via Self-reconfiguration problem of 3-d modelling)
(wherein the part a represents the choice phase to Fig. 2 Monte carlo algorithm process, and the part b represents extension and evaluation stage, c
Part represents the passback stage, and the part d represents the execution stage)
Fig. 3 overall algorithm block diagram
Specific embodiment
Now in conjunction with embodiment, attached drawing, the invention will be further described:
The purpose of the present invention is being the modularization robot of n (n > 100) a module composition for quantity, extensive chemical is utilized
It practises algorithm and realizes that Self-Reconfigurable Modular Robot dismounts order from arbitrary initial configuration to the module of specified target configuration, so that
It is few as far as possible to assemble number, promotes via Self-reconfiguration efficiency.
The object of algorithm application is mainly the robot by the identical module composition of specification, and each module at least two
A face can be used for docking.
To achieve the goals above, the technical solution adopted in the present invention the following steps are included:
Step 1: the initial module robot modeling and target configuration that a given number of modules is N, respectively by two
Configuration is converted into N × N × N matrix, and initializes neural network parameter.
Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore Meng Teka is called
The extension of Luo Shu and evaluation stage obtain commenting for prior probability P under initial configuration and current state by policy value network
Point v, and be recompensed function R (s) by environment, the child node of initial configuration then can be extended, and will deposit in child node
The information of storage is initialized.After the completion of extension, the passback step of Monte Carlo tree is carried out, updates leaf node to root node
The statistical information saved in each node on path.Root node has been unfolded at this time, so that it may be carried out in the tree of Monte Carlo
Choice phase, select the highest branch of TOP SCORES as next step act.It next is exactly to repeat selection, extension
And three certain numbers of step of assessment and passback.Then each configuration is obtained according to the access times of current configuration inferior division
Access probability, the movement for selecting the highest point of access probability to execute as next step.(target structure is found after search every time
Type reaches the setting step number upper limit), planning path is provided, and save planning sample, sample form is (s, π, z).
Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network is carried out
Training, training objective are to minimize loss function l=(z-v)2-πTlog p+c||θ2, and nerve net is updated after completing training
Network parameter.
Step 4: after parameter updates, monte carlo search is carried out since initial configuration again.It repeats Step 2: three,
Continuous iterative search optimal result.After the completion of each search, the least planning path of step number is updated.With trained continuous
Optimization, the step number for completing configuration conversion are fewer.
Specific step is as follows:
Step 1: parameter definition
Defining N is the module total number for constituting space module robot.
Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization machine
Device people's configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set
It is set to 0.
Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M
The size of a interface, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network.
Being averaged for the sum that Reward Program R is all module distance differences is defined, is always defined as the position of No. 1 module
Datum mark, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No.1 module, and the specific calculation formula of R is as follows:
Wherein xi′、yi' and zi' it is the cartesian coordinate that number is i cell in target configuration, x respectivelyi、yiAnd ziRespectively
It is the cartesian coordinate that number is i cell in current configuration.
The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value
W is the summation of value assessment v, and defining n is movement node visit number, wherein Q=W/n.
Definition vector P is the Policy evaluation value of policy value network output, and the dimension of the vector is exactly the dimension of motion space
Number, specially N × (N-1) × M
Step 2: Monte Carlo tree is established
The information for how to select movement a under one configuration s of decision is saved in each node of Monte Carlo tree.This
A little information include the access movement node a at configuration s frequency n (s, a), Reward Program R (s), total action value W (s, a),
Average action value Q (s, a) and at configuration s selection movement a prior probability P (s, a).
The process for establishing tree includes selection, extension and assessment and passback.It is available every by continuous iterative process
Movement probability distribution π under a configuration s, performs the next step movement according to the probability distribution of π.
A) it selects
Choice phase since a configuration node, along established tree construction, selects corresponding f (s)=Q+U+R most
Big branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:
Wherein cpuctThe balance between utilization is explored in control, and (s, a) is prior probability to P, and n (s, b) indicates father node
Access times, (s a) indicates the access times of a movement child node under the father node to n.
Whole process stops after encountering totally unknown configuration.Selection course has comprehensively considered action value, node
Exploration degree and the node are realizing the success rate in configuration conversion process.
B) it extends and assesses
When encountering totally unknown configuration, regulative strategy value network, the configuration matrix s of current configuration and target configuration
As incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide back
The assessment of function R (s) as current configuration is reported, wherein vector P is that the corresponding probability of movement a each can be performed under current configuration.
Obtain under current configuration it is all it is feasible movement and corresponding prior probability after, so that it may extend the child node of unknown configuration,
And by the information initializing stored in child node be n (s, a)=0, W (s, a)=0, Q (s, a)=0, P (s, a)=p, R (s)=
R
C) it returns
When extending with after the completion of evaluation stage, the system saved in each node on leaf node to root node path is updated
Information is counted, node visit number, total action value and average action value, more new formula are specifically included are as follows:
N=n+1
W=W+v
Q=(W+v)/(n+1)
D) it executes
Repeating a~step c m times, (m is set as 400 here, for according to configuration complexity and cell number tune
It is whole) after, the access probability of each configuration is obtained according to the access times of current configuration inferior division, selects access probability highest
Point as in next step execution movement.
Iteration a~Step d, until finding target configuration or reaching iteration upper limit k (k is according to number of modules and configuration
Complexity determines), after reaching stopping criterion for iteration (find target configuration or reach the iteration upper limit), iterative process is saved
For a series of samples, sample form is a tuple (s, π, z), and wherein s is the description of current configuration, and π is under the configuration
The movement probability distribution that MCTS is returned, z is evaluation to this innings after one innings of planning, if finding target in iteration upper limit k
Configuration, then z is 1, if being more than that iteration upper limit k does not search out target configuration, z 0 also.These samples policy value later
The training and parameter optimization of network.
Step 3: policy value network establishment
By the Monte Carlo tree that iterates, available training sample comes optimization neural network, the generating process of sample
See attached drawing 2.
Here the input of definition strategy value network is current block configuration matrix and object module configuration matrix, exports and is
The probability P (policy section) of each possible action and condition grading v (value part) under the configuration.
The convolutional layer that the convolution kernel that the configuration matrix s of input first passes around 256 3 × 3, stride is 1 is constituted, then passes through
Cross batch normalized and nonlinear activation function ReLU output.Subsequent pass through a string of residual error modules (3-5).Each
Residual error inside modules, input signal successively pass through convolutional layer, batch normalizing being made of 256 3 × 3, the convolution kernel that stride is 1
Change layer, nonlinear activation function ReLU, the convolutional layer being made of 256 3 × 3, the convolution kernel that stride is 1, batch normalization layer,
Then it with the direct-connected Signal averaging of importation, is finally exported by nonlinear activation function ReLU.
After by a series of residual error module, signal finally enters output module.Output module is divided into tactful output
Two parts are exported with value, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, are then passed through
Normalization and ReLU activation primitive are criticized, finally by the vector of full articulamentum output N × (N-1) × M dimension, is corresponded to all
The probability of possible shift action.Value output par, c first passes around the convolutional layer comprising 11 × 1 convolution kernel, then by criticizing
Normalization and ReLU activation primitive finally pass through one 1 followed by the full articulamentum and ReLU activation primitive of 256 dimension outputs again
The full articulamentum of output is tieed up, the value assessment for current configuration is exported.
In iteration Monte Carlo during tree, a part of sample is had accumulated, when sample number reaches given value (one
As for 100 or it is more) when, so that it may start to train neural network.The target of optimization neural network is exactly to allow neural network
The s feedback function v of the movement Probability p of prediction and configuration respectively in sample (s, π, z) π and z be fitted, allowable loss thus
Function are as follows:
L=(z-v)2-πTlog p+c||θ||2
The target of policy value network training is exactly that above-mentioned loss function is minimized on the data set of preservation, and wherein θ is just
It is neural network parameter, c is the parameter for controlling regularization degree.
Step 4: Shortest Path Searching result is saved
After carrying out primary complete monte carlo search (i.e. arrival target configuration), will save from initial configuration to
The planning path and total step number of target configuration.In iterative process, only retain the least primary search optimal result of step number.It is aobvious
So, as the number of iterations increases, total step number is fewer, and search result is more excellent.
Claims (2)
1. a kind of heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement, the modular robot
Specification is identical, and each module at least can be used for docking there are two face;It is characterized by the following steps:
Step 1: the initial module robot modeling and target configuration that a given number of modules is N respectively turn two configurations
N × N × N matrix is turned to, and initializes neural network parameter:
Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization robot
Configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set as 0;
Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M a right
The size of junction, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network;
Being averaged for the sum that Reward Program R is all module distance differences is defined, it always will be on the basis of the position regulation of No. 1 module
Point, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No. 1 module, and the calculation formula of Reward Program R is as follows:
Wherein xi′、yi' and zi' it is the cartesian coordinate that number is i module in target configuration, x respectivelyi、yiAnd ziWork as respectively
Number is the cartesian coordinate of i module in preceding configuration;
The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value W and is
The summation of value assessment v, defining n is movement node visit number, wherein Q=W/n;
Definition vector P is that the corresponding elder generation of movement a can be performed under the Policy evaluation value that policy value network exports and a certain configuration
Probability is tested, the dimension of vector P is exactly the dimension of motion space, specially N × (N-1) × M;
Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore calling Monte Carlo tree
Extension and evaluation stage, the scoring v of the prior probability P under initial configuration and current state is obtained by policy value network, and
It is recompensed function R (s) by environment, then can extend the child node of initial configuration, and the letter that will be stored in child node
Breath is initialized;After the completion of extension, the passback step of Monte Carlo tree is carried out, is updated on leaf node to root node path
The statistical information saved in each node;Root node has been unfolded at this time, so that it may carry out the selection rank in the tree of Monte Carlo
Section selects the highest branch of TOP SCORES to act as next step;It next is exactly to repeat selection, extension and assessment and return
Pass three certain numbers of step;Then the access probability of each configuration is obtained according to the access times of current configuration inferior division, selected
Select the movement that the highest point of access probability is executed as next step;Target configuration is found after search every time or reaches setting
The step number upper limit provides planning path, and saves planning sample, and sample form is (s, π, z);
The Monte Carlo tree includes selection, extension and assessment and passback;
A) it selects
Choice phase since a configuration node, along established tree construction, selects corresponding maximum point of f (s)=Q+U+R
Branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:
Wherein cpuctThe balance between utilization is explored in control, and n (s, b) indicates the access times of father node, and (s a) indicates the father to n
A acts the access times of child node under node;
Whole process stops after encountering totally unknown configuration;
B) it extends and assesses
When encountering totally unknown configuration, regulative strategy value network, using the configuration matrix s of current configuration and target configuration as
Incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide return letter
Number R (s) assessment as current configuration, obtain under current configuration it is all it is feasible act and corresponding prior probability after, just
Can extend the child node of unknown configuration, and by the information initializing stored in child node be n (s, a)=0, W (and s, a)=0, Q
(s, a)=0, P (s, a)=p, R (s)=R;
C) it returns
Believe when extending and after the completion of evaluation stage, updating the statistics saved in each node on leaf node to root node path
Breath, statistical information include node visit number, total action value and average action value, more new formula are as follows:
N=n+1
W=W+v
Q=(W+v)/(n+1)
D) it executes
After repeating the certain number of a~c, the access for obtaining each configuration according to the access times of current configuration inferior division is general
Rate, the movement for selecting the highest point of access probability to execute as next step;
Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network is trained,
Training objective is to minimize loss function l=(z-v)2-πTlogp+c||θ||2, and neural network ginseng is updated after completing training
Number;
The input of the policy value network is current block configuration matrix and object module configuration matrix, is exported as the configuration
The probability P and condition grading v of each lower possible action;
The convolution kernel that the current block configuration matrix and object module configuration matrix of input first pass around 256 3 × 3, stride is 1
The convolutional layer of composition, then by batch normalized and nonlinear activation function ReLU output;Subsequent pass through a string of residual errors
Module, in each residual error inside modules, input signal successively passes through the convolution being made of 256 3 × 3, the convolution kernel that stride is 1
Layer, nonlinear activation function ReLU, the convolutional layer being made of 256 3 × 3, the convolution kernel that stride is 1, is criticized and is returned batch normalization layer
One changes layer, then with the direct-connected Signal averaging of importation, finally exports by nonlinear activation function ReLU;
After the residual error module by a string, signal finally enters output module, and output module is divided into strategy output and value
Two parts are exported, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, then by batch normalizing
Change and ReLU activation primitive corresponds to all possible mobile finally by the vector of full articulamentum output N × (N-1) × M dimension
The probability of movement;Value output par, c first pass around the convolutional layer comprising 11 × 1 convolution kernel, then by batch normalization and
ReLU activation primitive is finally exported by one 1 dimension again followed by the full articulamentum and ReLU activation primitive of 256 dimension outputs
Full articulamentum exports the value assessment for current configuration;
Step 4: after parameter updates, carrying out monte carlo search since initial configuration again;Repeat step 2-3, continuous iteration
Search for optimal result;After the completion of each search, the least planning path of step number is updated.
2. a kind of via Self-reconfiguration planning side, heterogeneous module robot based on nitrification enhancement according to claim 1
Method, it is characterised in that a string of residual error modules in the step 3 are 3-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910523043.0A CN110297490B (en) | 2019-06-17 | 2019-06-17 | Self-reconstruction planning method of heterogeneous modular robot based on reinforcement learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910523043.0A CN110297490B (en) | 2019-06-17 | 2019-06-17 | Self-reconstruction planning method of heterogeneous modular robot based on reinforcement learning algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110297490A true CN110297490A (en) | 2019-10-01 |
CN110297490B CN110297490B (en) | 2022-06-07 |
Family
ID=68028152
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910523043.0A Active CN110297490B (en) | 2019-06-17 | 2019-06-17 | Self-reconstruction planning method of heterogeneous modular robot based on reinforcement learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110297490B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909146A (en) * | 2019-11-29 | 2020-03-24 | 支付宝(杭州)信息技术有限公司 | Label pushing model training method, device and equipment for pushing question-back labels |
CN111104732A (en) * | 2019-12-03 | 2020-05-05 | 中国人民解放军国防科技大学 | Intelligent planning method for mobile communication network based on deep reinforcement learning |
CN111230875A (en) * | 2020-02-06 | 2020-06-05 | 北京凡川智能机器人科技有限公司 | Double-arm robot humanoid operation planning method based on deep learning |
CN111679679A (en) * | 2020-07-06 | 2020-09-18 | 哈尔滨工业大学 | Robot state planning method based on Monte Carlo tree search algorithm |
CN112264999A (en) * | 2020-10-28 | 2021-01-26 | 复旦大学 | Method, device and storage medium for intelligent agent continuous space action planning |
CN113704098A (en) * | 2021-08-18 | 2021-11-26 | 武汉大学 | Deep learning fuzzy test method based on Monte Carlo search tree seed scheduling |
CN114020024A (en) * | 2021-11-05 | 2022-02-08 | 南京理工大学 | Unmanned aerial vehicle path planning method based on Monte Carlo tree search |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106503373A (en) * | 2016-11-04 | 2017-03-15 | 湘潭大学 | The method for planning track that a kind of Dual-robot coordination based on B-spline curves is assembled |
CN106931970A (en) * | 2015-12-30 | 2017-07-07 | 北京雷动云合智能技术有限公司 | Robot security's contexture by self air navigation aid in a kind of dynamic environment |
CN107161357A (en) * | 2017-04-27 | 2017-09-15 | 西北工业大学 | A kind of via Self-reconfiguration Method of restructural spacecraft |
CN107471206A (en) * | 2017-08-16 | 2017-12-15 | 大连交通大学 | A kind of modularization industrial robot reconfiguration system and its control method |
CN107591844A (en) * | 2017-09-22 | 2018-01-16 | 东南大学 | Consider the probabilistic active distribution network robust reconstructing method of node injecting power |
WO2018154153A2 (en) * | 2017-11-27 | 2018-08-30 | Erle Robotics, S.L. | Method for designing modular robots |
CN109871943A (en) * | 2019-02-20 | 2019-06-11 | 华南理工大学 | A kind of depth enhancing learning method for big two three-wheel arrangement of pineapple playing card |
-
2019
- 2019-06-17 CN CN201910523043.0A patent/CN110297490B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106931970A (en) * | 2015-12-30 | 2017-07-07 | 北京雷动云合智能技术有限公司 | Robot security's contexture by self air navigation aid in a kind of dynamic environment |
CN106503373A (en) * | 2016-11-04 | 2017-03-15 | 湘潭大学 | The method for planning track that a kind of Dual-robot coordination based on B-spline curves is assembled |
CN107161357A (en) * | 2017-04-27 | 2017-09-15 | 西北工业大学 | A kind of via Self-reconfiguration Method of restructural spacecraft |
CN107471206A (en) * | 2017-08-16 | 2017-12-15 | 大连交通大学 | A kind of modularization industrial robot reconfiguration system and its control method |
CN107591844A (en) * | 2017-09-22 | 2018-01-16 | 东南大学 | Consider the probabilistic active distribution network robust reconstructing method of node injecting power |
WO2018154153A2 (en) * | 2017-11-27 | 2018-08-30 | Erle Robotics, S.L. | Method for designing modular robots |
CN109871943A (en) * | 2019-02-20 | 2019-06-11 | 华南理工大学 | A kind of depth enhancing learning method for big two three-wheel arrangement of pineapple playing card |
Non-Patent Citations (5)
Title |
---|
FEILIHOU: "Graph-based optimal reconfiguration planning for self-reconfigurable robots", 《ROBOTICS AND AUTONOMOUS SYSTEMS》 * |
YIFEI ZHANG: "Reconfiguration Planning for Heterogeneous Cellular", 《PROCEEDINGS OF THE 2017 18TH》 * |
苑丹丹: "基于蒙特卡洛法的模块化机器人工作空间分析", 《机床与液压》 * |
费燕琼: "自重构模块化机器人的结构", 《上海交通大学学报》 * |
黄攀峰: "参数未知航天器的姿态接管控制", 《控制与决策》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110909146B (en) * | 2019-11-29 | 2022-09-09 | 支付宝(杭州)信息技术有限公司 | Label pushing model training method, device and equipment for pushing question-back labels |
CN110909146A (en) * | 2019-11-29 | 2020-03-24 | 支付宝(杭州)信息技术有限公司 | Label pushing model training method, device and equipment for pushing question-back labels |
CN111104732A (en) * | 2019-12-03 | 2020-05-05 | 中国人民解放军国防科技大学 | Intelligent planning method for mobile communication network based on deep reinforcement learning |
CN111104732B (en) * | 2019-12-03 | 2022-09-13 | 中国人民解放军国防科技大学 | Intelligent planning method for mobile communication network based on deep reinforcement learning |
CN111230875A (en) * | 2020-02-06 | 2020-06-05 | 北京凡川智能机器人科技有限公司 | Double-arm robot humanoid operation planning method based on deep learning |
CN111230875B (en) * | 2020-02-06 | 2023-05-12 | 北京凡川智能机器人科技有限公司 | Double-arm robot humanoid operation planning method based on deep learning |
CN111679679A (en) * | 2020-07-06 | 2020-09-18 | 哈尔滨工业大学 | Robot state planning method based on Monte Carlo tree search algorithm |
WO2022007199A1 (en) * | 2020-07-06 | 2022-01-13 | 哈尔滨工业大学 | Robot state planning method based on monte carlo tree search algorithm |
CN112264999A (en) * | 2020-10-28 | 2021-01-26 | 复旦大学 | Method, device and storage medium for intelligent agent continuous space action planning |
CN113704098A (en) * | 2021-08-18 | 2021-11-26 | 武汉大学 | Deep learning fuzzy test method based on Monte Carlo search tree seed scheduling |
CN113704098B (en) * | 2021-08-18 | 2023-09-22 | 武汉大学 | Deep learning fuzzy test method based on Monte Carlo search tree seed scheduling |
CN114020024A (en) * | 2021-11-05 | 2022-02-08 | 南京理工大学 | Unmanned aerial vehicle path planning method based on Monte Carlo tree search |
CN114020024B (en) * | 2021-11-05 | 2023-03-31 | 南京理工大学 | Unmanned aerial vehicle path planning method based on Monte Carlo tree search |
Also Published As
Publication number | Publication date |
---|---|
CN110297490B (en) | 2022-06-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110297490A (en) | Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement | |
CN108573303A (en) | It is a kind of that recovery policy is improved based on the complex network local failure for improving intensified learning certainly | |
CN109241291A (en) | Knowledge mapping optimal path inquiry system and method based on deeply study | |
WO2020029583A1 (en) | Multiplication and addition calculation method and calculation circuit suitable for neural network | |
CN105467997A (en) | Storage robot path program method based on linear temporal logic theory | |
CN105509749A (en) | Mobile robot path planning method and system based on genetic ant colony algorithm | |
CN109409510A (en) | Neuron circuit, chip, system and method, storage medium | |
CN108921298A (en) | Intensified learning multiple agent is linked up and decision-making technique | |
CN110188880A (en) | A kind of quantization method and device of deep neural network | |
CN110883776A (en) | Robot path planning algorithm for improving DQN under quick search mechanism | |
CN105978732A (en) | Method and system for optimizing parameters of minimum complexity echo state network based on particle swarm | |
CN103646008A (en) | Web service combination method | |
CN104050505A (en) | Multilayer-perceptron training method based on bee colony algorithm with learning factor | |
CN108536144A (en) | A kind of paths planning method of fusion dense convolutional network and competition framework | |
CN111159489A (en) | Searching method | |
CN113807040A (en) | Optimal design method for microwave circuit | |
Du et al. | Application of an improved whale optimization algorithm in time-optimal trajectory planning for manipulators | |
CN116841303A (en) | Intelligent preferential high-order iterative self-learning control method for underwater robot | |
CN107273970B (en) | Reconfigurable platform of convolutional neural network supporting online learning and construction method thereof | |
CN115327926A (en) | Multi-agent dynamic coverage control method and system based on deep reinforcement learning | |
CN115271254A (en) | Short-term wind power prediction method for optimizing extreme learning machine based on gull algorithm | |
CN114564039A (en) | Flight path planning method based on deep Q network and fast search random tree algorithm | |
CN112001558A (en) | Method and device for researching optimal operation mode of power distribution network equipment | |
Verma et al. | A novel evolutionary neural learning algorithm | |
Han et al. | An improved ant colony optimization algorithm based on dynamic control of solution construction and mergence of local search solutions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |