CN110297490A - Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement - Google Patents

Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement Download PDF

Info

Publication number
CN110297490A
CN110297490A CN201910523043.0A CN201910523043A CN110297490A CN 110297490 A CN110297490 A CN 110297490A CN 201910523043 A CN201910523043 A CN 201910523043A CN 110297490 A CN110297490 A CN 110297490A
Authority
CN
China
Prior art keywords
configuration
module
value
node
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910523043.0A
Other languages
Chinese (zh)
Other versions
CN110297490B (en
Inventor
张夷斋
王文卉
黄攀峰
孟中杰
常海涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN201910523043.0A priority Critical patent/CN110297490B/en
Publication of CN110297490A publication Critical patent/CN110297490A/en
Application granted granted Critical
Publication of CN110297490B publication Critical patent/CN110297490B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

The heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement that the present invention relates to a kind of, an initial module robot modeling given first and target configuration, input module sum N, and establish by initialization procedure the graph structure of module;Using initial configuration as root node, the search of Monte Carlo tree is established, and is stopped search when reaching termination condition (find target configuration or carried out n times exploration).Planning path is provided after search every time, and saves sample;After sample number reaches given value, sample input neural network is trained, and update training parameter;After undated parameter, monte carlo search is carried out again.This time the average step number of search result should be less, after the completion of each search, updates the least planning path of step number.

Description

Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement
Technical field
The invention belongs to artificial intelligence planning fields, and in particular to exist using nitrification enhancement optimized modular injection mold robot Planning in autonomous deformation process.
Background technique
As robot is in the extensive use of every field, robot is more and more applied in non-structure environment and completes Operation, and for unknown working environment and different tasks, need robot that can reach by changing its configuration To the requirement for adapting to environment, this configuration that can dynamically, independently change referred to as is self-possessed with adapting to the robot of task needs Structure modularization robot.Self-Reconfigurable Modular Robot is a series of, the unit module composition of Various Functions simple by structures. In order to realize better task and environmental suitability, when multi-module system operation when, each modular unit pass through respectively connection with Disconnection movement, realizes change of configuration, independently to meet the needs of environment and task.With ordinary robot except that it puts The limitation of fixed configuration has been taken off, and can independently complete the variation of configuration, therefore for operation under circumstances not known, such as calamity Field is explored in hardly possible rescue, nuclear power station maintenance and space flight has significant advantage.
The self-reorganization robot deformation process being made of multimode is as shown in Figure 1.Although modularization robot can basis The difference of environment and task changes structure, to reach optimal configuration, but needs in reconstitution movement to each A participation module carries out motion planning, and the mode converted from current configuration to target configuration is not unique, theoretically configuration Number of modules is more, and the solution that can reach target configuration is more, therefore how from participation number of modules is reduced, reduces module and carry step Number shortens reconstitution time etc. and seeks to reconstruct optimal solution, becomes a critical issue of via Self-reconfiguration research field.
What general via Self-reconfiguration algorithm was directed to is all the relatively simple modularization robot of configuration, and emphasis is motion process In deformation.The robot complicated for configuration, module is more, existing algorithm are to guarantee that initial configuration arrives target configuration There is solution, neutral configuration can be introduced as transition, it is excessive that this will will lead to the mobile number of module, inefficiency.In how eliminating Between configuration, realize fast transition of the space module robot directly from initial configuration to target configuration, be urgently to be resolved ask Topic.
Intelligent algorithm flourishes in recent years, and wherein deeply learning algorithm obtains in the fields such as machine planning and control Extensive research and application are arrived.Especially AlphaGo's and AlphaZero is announced to the world splendidly, and deeply is learnt in task The ability orientation of planning aspect has arrived ultimate attainment.
Summary of the invention
Technical problems to be solved
In order to avoid the shortcomings of the prior art, promoting via Self-reconfiguration efficiency, the present invention proposes a kind of based on intensified learning The heterogeneous module robot via Self-reconfiguration planing method of algorithm.
Technical solution
A kind of heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement, the modular robot Specification it is identical, and each module at least can be used for docking there are two face;It is characterized by the following steps:
Step 1: the initial module robot modeling and target configuration that a given number of modules is N, respectively by two structures Type is converted into N × N × N matrix, and initializes neural network parameter:
Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization machine Device people's configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set It is set to 0;
Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M The size of a interface, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network;
Being averaged for the sum that Reward Program R is all module distance differences is defined, is always defined as the position of No. 1 module Datum mark, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No. 1 module, and the calculation formula of Reward Program R is such as Under:
Wherein xi′、yi' and zi' it is the cartesian coordinate that number is i module in target configuration, x respectivelyi、yiAnd ziRespectively It is the cartesian coordinate that number is i module in current configuration;
The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value W is the summation of value assessment v, and defining n is movement node visit number, wherein Q=W/n;
Definition vector P is that a pairs of movement can be performed under the Policy evaluation value that policy value network exports and a certain configuration The dimension of the prior probability answered, vector P is exactly the dimension of motion space, specially N × (N-1) × M;
Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore calling Meng Teka The extension of Luo Shu and evaluation stage obtain commenting for prior probability P under initial configuration and current state by policy value network Point v, and be recompensed function R (s) by environment, the child node of initial configuration then can be extended, and will deposit in child node The information of storage is initialized;After the completion of extension, the passback step of Monte Carlo tree is carried out, updates leaf node to root node The statistical information saved in each node on path;Root node has been unfolded at this time, so that it may be carried out in the tree of Monte Carlo Choice phase, select the highest branch of TOP SCORES as next step act;It next is exactly to repeat selection, extension And three certain numbers of step of assessment and passback;Then each configuration is obtained according to the access times of current configuration inferior division Access probability, the movement for selecting the highest point of access probability to execute as next step;Target structure is found after search every time Type reaches the setting step number upper limit, provides planning path, and save planning sample, sample form is (s, π, z);
The Monte Carlo tree includes selection, extension and assessment and passback;
A) it selects
Choice phase since a configuration node, along established tree construction, selects corresponding f (s)=Q+U+R most Big branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:
Wherein cpuctThe balance between utilization is explored in control, and n (s, b) indicates the access times of father node, n (s, a) table Show the access times of a movement child node under the father node;
Whole process stops after encountering totally unknown configuration;
B) it extends and assesses
When encountering totally unknown configuration, regulative strategy value network, the configuration matrix s of current configuration and target configuration As incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide back The assessment of function R (s) as current configuration is reported, all feasible movements and corresponding prior probability under current configuration are obtained Afterwards, so that it may extend the child node of unknown configuration, and by the information initializing stored in child node be n (s, a)=0, W (and s, a) =0, Q (s, a)=0, P (s, a)=p, R (s)=R;
C) it returns
When extending with after the completion of evaluation stage, the system saved in each node on leaf node to root node path is updated Information is counted, statistical information includes node visit number, total action value and average action value, more new formula are as follows:
N=n+1
W=W+v
Q=(W+v)/(n+1)
D) it executes
After repeating the certain number of a~c, the visit of each configuration is obtained according to the access times of current configuration inferior division Ask probability, the movement for selecting the highest point of access probability to execute as next step;
Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network being carried out Training, training objective are to minimize loss function l=(z-v)2Tlog p+c||θ||2, and nerve is updated after completing training Network parameter;
The input of the policy value network is current block configuration matrix and object module configuration matrix, is exported as this The probability P of each possible action and condition grading v under configuration;
The current block configuration matrix and object module configuration matrix of input first pass around 256 3 × 3, stride be 1 The convolutional layer that convolution kernel is constituted, then by batch normalized and nonlinear activation function ReLU output;Subsequent pass through one String residual error module, in each residual error inside modules, input signal is successively passed through by 256 3 × 3, the convolution kernel structure that stride is 1 At convolutional layer, batch normalization layer, nonlinear activation function ReLU, by 256 3 × 3, the volume that constitutes of convolution kernel that stride is 1 Lamination, batch normalization layer are finally exported by nonlinear activation function ReLU then with the direct-connected Signal averaging of importation;
After by a string of residual error module, signal finally enters output module, output module be divided into strategy output and Value output two parts, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, then by criticizing Normalization and ReLU activation primitive, finally by full articulamentum output N × (N-1) × M dimension vector, correspond to it is all can The probability of energy shift action;Value output par, c first passes around the convolutional layer comprising 11 × 1 convolution kernel, then returns by batch One changes with ReLU activation primitive, and the full articulamentum and ReLU activation primitive for tieing up output followed by 256 finally pass through one 1 dimension again The full articulamentum of output exports the value assessment for current configuration;
Step 4: after parameter updates, carrying out monte carlo search since initial configuration again;Step 2-3 is repeated, constantly Iterative search optimal result;After the completion of each search, the least planning path of step number is updated.
A string of residual error modules in the step 3 are 3-5.
Beneficial effect
The invention proposes a kind of algorithms that modularization robot via Self-reconfiguration is solved the problems, such as using intensified learning, known first Under the premise of beginning configuration and target configuration, the efficient carrying planning of module is obtained, via Self-reconfiguration efficiency is obviously improved.
Detailed description of the invention
(the figure shows the process of two dimensional configurations via Self-reconfiguration transformation, actual algorithm is complete for Fig. 1 self-reconfigurable mechanism schematic diagram It can solve the via Self-reconfiguration problem of 3-d modelling)
(wherein the part a represents the choice phase to Fig. 2 Monte carlo algorithm process, and the part b represents extension and evaluation stage, c Part represents the passback stage, and the part d represents the execution stage)
Fig. 3 overall algorithm block diagram
Specific embodiment
Now in conjunction with embodiment, attached drawing, the invention will be further described:
The purpose of the present invention is being the modularization robot of n (n > 100) a module composition for quantity, extensive chemical is utilized It practises algorithm and realizes that Self-Reconfigurable Modular Robot dismounts order from arbitrary initial configuration to the module of specified target configuration, so that It is few as far as possible to assemble number, promotes via Self-reconfiguration efficiency.
The object of algorithm application is mainly the robot by the identical module composition of specification, and each module at least two A face can be used for docking.
To achieve the goals above, the technical solution adopted in the present invention the following steps are included:
Step 1: the initial module robot modeling and target configuration that a given number of modules is N, respectively by two Configuration is converted into N × N × N matrix, and initializes neural network parameter.
Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore Meng Teka is called The extension of Luo Shu and evaluation stage obtain commenting for prior probability P under initial configuration and current state by policy value network Point v, and be recompensed function R (s) by environment, the child node of initial configuration then can be extended, and will deposit in child node The information of storage is initialized.After the completion of extension, the passback step of Monte Carlo tree is carried out, updates leaf node to root node The statistical information saved in each node on path.Root node has been unfolded at this time, so that it may be carried out in the tree of Monte Carlo Choice phase, select the highest branch of TOP SCORES as next step act.It next is exactly to repeat selection, extension And three certain numbers of step of assessment and passback.Then each configuration is obtained according to the access times of current configuration inferior division Access probability, the movement for selecting the highest point of access probability to execute as next step.(target structure is found after search every time Type reaches the setting step number upper limit), planning path is provided, and save planning sample, sample form is (s, π, z).
Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network is carried out Training, training objective are to minimize loss function l=(z-v)2Tlog p+c||θ2, and nerve net is updated after completing training Network parameter.
Step 4: after parameter updates, monte carlo search is carried out since initial configuration again.It repeats Step 2: three, Continuous iterative search optimal result.After the completion of each search, the least planning path of step number is updated.With trained continuous Optimization, the step number for completing configuration conversion are fewer.
Specific step is as follows:
Step 1: parameter definition
Defining N is the module total number for constituting space module robot.
Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization machine Device people's configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set It is set to 0.
Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M The size of a interface, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network.
Being averaged for the sum that Reward Program R is all module distance differences is defined, is always defined as the position of No. 1 module Datum mark, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No.1 module, and the specific calculation formula of R is as follows:
Wherein xi′、yi' and zi' it is the cartesian coordinate that number is i cell in target configuration, x respectivelyi、yiAnd ziRespectively It is the cartesian coordinate that number is i cell in current configuration.
The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value W is the summation of value assessment v, and defining n is movement node visit number, wherein Q=W/n.
Definition vector P is the Policy evaluation value of policy value network output, and the dimension of the vector is exactly the dimension of motion space Number, specially N × (N-1) × M
Step 2: Monte Carlo tree is established
The information for how to select movement a under one configuration s of decision is saved in each node of Monte Carlo tree.This A little information include the access movement node a at configuration s frequency n (s, a), Reward Program R (s), total action value W (s, a), Average action value Q (s, a) and at configuration s selection movement a prior probability P (s, a).
The process for establishing tree includes selection, extension and assessment and passback.It is available every by continuous iterative process Movement probability distribution π under a configuration s, performs the next step movement according to the probability distribution of π.
A) it selects
Choice phase since a configuration node, along established tree construction, selects corresponding f (s)=Q+U+R most Big branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:
Wherein cpuctThe balance between utilization is explored in control, and (s, a) is prior probability to P, and n (s, b) indicates father node Access times, (s a) indicates the access times of a movement child node under the father node to n.
Whole process stops after encountering totally unknown configuration.Selection course has comprehensively considered action value, node Exploration degree and the node are realizing the success rate in configuration conversion process.
B) it extends and assesses
When encountering totally unknown configuration, regulative strategy value network, the configuration matrix s of current configuration and target configuration As incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide back The assessment of function R (s) as current configuration is reported, wherein vector P is that the corresponding probability of movement a each can be performed under current configuration. Obtain under current configuration it is all it is feasible movement and corresponding prior probability after, so that it may extend the child node of unknown configuration, And by the information initializing stored in child node be n (s, a)=0, W (s, a)=0, Q (s, a)=0, P (s, a)=p, R (s)= R
C) it returns
When extending with after the completion of evaluation stage, the system saved in each node on leaf node to root node path is updated Information is counted, node visit number, total action value and average action value, more new formula are specifically included are as follows:
N=n+1
W=W+v
Q=(W+v)/(n+1)
D) it executes
Repeating a~step c m times, (m is set as 400 here, for according to configuration complexity and cell number tune It is whole) after, the access probability of each configuration is obtained according to the access times of current configuration inferior division, selects access probability highest Point as in next step execution movement.
Iteration a~Step d, until finding target configuration or reaching iteration upper limit k (k is according to number of modules and configuration Complexity determines), after reaching stopping criterion for iteration (find target configuration or reach the iteration upper limit), iterative process is saved For a series of samples, sample form is a tuple (s, π, z), and wherein s is the description of current configuration, and π is under the configuration The movement probability distribution that MCTS is returned, z is evaluation to this innings after one innings of planning, if finding target in iteration upper limit k Configuration, then z is 1, if being more than that iteration upper limit k does not search out target configuration, z 0 also.These samples policy value later The training and parameter optimization of network.
Step 3: policy value network establishment
By the Monte Carlo tree that iterates, available training sample comes optimization neural network, the generating process of sample See attached drawing 2.
Here the input of definition strategy value network is current block configuration matrix and object module configuration matrix, exports and is The probability P (policy section) of each possible action and condition grading v (value part) under the configuration.
The convolutional layer that the convolution kernel that the configuration matrix s of input first passes around 256 3 × 3, stride is 1 is constituted, then passes through Cross batch normalized and nonlinear activation function ReLU output.Subsequent pass through a string of residual error modules (3-5).Each Residual error inside modules, input signal successively pass through convolutional layer, batch normalizing being made of 256 3 × 3, the convolution kernel that stride is 1 Change layer, nonlinear activation function ReLU, the convolutional layer being made of 256 3 × 3, the convolution kernel that stride is 1, batch normalization layer, Then it with the direct-connected Signal averaging of importation, is finally exported by nonlinear activation function ReLU.
After by a series of residual error module, signal finally enters output module.Output module is divided into tactful output Two parts are exported with value, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, are then passed through Normalization and ReLU activation primitive are criticized, finally by the vector of full articulamentum output N × (N-1) × M dimension, is corresponded to all The probability of possible shift action.Value output par, c first passes around the convolutional layer comprising 11 × 1 convolution kernel, then by criticizing Normalization and ReLU activation primitive finally pass through one 1 followed by the full articulamentum and ReLU activation primitive of 256 dimension outputs again The full articulamentum of output is tieed up, the value assessment for current configuration is exported.
In iteration Monte Carlo during tree, a part of sample is had accumulated, when sample number reaches given value (one As for 100 or it is more) when, so that it may start to train neural network.The target of optimization neural network is exactly to allow neural network The s feedback function v of the movement Probability p of prediction and configuration respectively in sample (s, π, z) π and z be fitted, allowable loss thus Function are as follows:
L=(z-v)2Tlog p+c||θ||2
The target of policy value network training is exactly that above-mentioned loss function is minimized on the data set of preservation, and wherein θ is just It is neural network parameter, c is the parameter for controlling regularization degree.
Step 4: Shortest Path Searching result is saved
After carrying out primary complete monte carlo search (i.e. arrival target configuration), will save from initial configuration to The planning path and total step number of target configuration.In iterative process, only retain the least primary search optimal result of step number.It is aobvious So, as the number of iterations increases, total step number is fewer, and search result is more excellent.

Claims (2)

1. a kind of heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement, the modular robot Specification is identical, and each module at least can be used for docking there are two face;It is characterized by the following steps:
Step 1: the initial module robot modeling and target configuration that a given number of modules is N respectively turn two configurations N × N × N matrix is turned to, and initializes neural network parameter:
Definition configuration states s is N × N × N matrix, module is numbered from 1 to N, then by modularization robot Configuration is put into N × N × N matrix, has the corresponding corresponding module number in the place of module, the place of module is not set as 0;
Definition wins i module and with the k of j module in face of being connected in an executable movement a, it is assumed that each module has M a right The size of junction, motion space is N × (N-1) × M, this is also the output dimension of tactful neural network;
Being averaged for the sum that Reward Program R is all module distance differences is defined, it always will be on the basis of the position regulation of No. 1 module Point, coordinate are fixed as (0,0,0), remaining module is positioned on the basis of No. 1 module, and the calculation formula of Reward Program R is as follows:
Wherein xi′、yi' and zi' it is the cartesian coordinate that number is i module in target configuration, x respectivelyi、yiAnd ziWork as respectively Number is the cartesian coordinate of i module in preceding configuration;
The average action value Q of definition is the average value of the value assessment v of policy value network output, defines total action value W and is The summation of value assessment v, defining n is movement node visit number, wherein Q=W/n;
Definition vector P is that the corresponding elder generation of movement a can be performed under the Policy evaluation value that policy value network exports and a certain configuration Probability is tested, the dimension of vector P is exactly the dimension of motion space, specially N × (N-1) × M;
Step 2: using initial configuration as root node, at this point, the initial configuration is not explored also, therefore calling Monte Carlo tree Extension and evaluation stage, the scoring v of the prior probability P under initial configuration and current state is obtained by policy value network, and It is recompensed function R (s) by environment, then can extend the child node of initial configuration, and the letter that will be stored in child node Breath is initialized;After the completion of extension, the passback step of Monte Carlo tree is carried out, is updated on leaf node to root node path The statistical information saved in each node;Root node has been unfolded at this time, so that it may carry out the selection rank in the tree of Monte Carlo Section selects the highest branch of TOP SCORES to act as next step;It next is exactly to repeat selection, extension and assessment and return Pass three certain numbers of step;Then the access probability of each configuration is obtained according to the access times of current configuration inferior division, selected Select the movement that the highest point of access probability is executed as next step;Target configuration is found after search every time or reaches setting The step number upper limit provides planning path, and saves planning sample, and sample form is (s, π, z);
The Monte Carlo tree includes selection, extension and assessment and passback;
A) it selects
Choice phase since a configuration node, along established tree construction, selects corresponding maximum point of f (s)=Q+U+R Branch, wherein Q is average action value, and U is confidence upper limit, and R is Reward Program, U's specific formula is as follows:
Wherein cpuctThe balance between utilization is explored in control, and n (s, b) indicates the access times of father node, and (s a) indicates the father to n A acts the access times of child node under node;
Whole process stops after encountering totally unknown configuration;
B) it extends and assesses
When encountering totally unknown configuration, regulative strategy value network, using the configuration matrix s of current configuration and target configuration as Incoming network is inputted, policy value network can return to the scoring v of prior probability P and current state, and environment can also provide return letter Number R (s) assessment as current configuration, obtain under current configuration it is all it is feasible act and corresponding prior probability after, just Can extend the child node of unknown configuration, and by the information initializing stored in child node be n (s, a)=0, W (and s, a)=0, Q (s, a)=0, P (s, a)=p, R (s)=R;
C) it returns
Believe when extending and after the completion of evaluation stage, updating the statistics saved in each node on leaf node to root node path Breath, statistical information include node visit number, total action value and average action value, more new formula are as follows:
N=n+1
W=W+v
Q=(W+v)/(n+1)
D) it executes
After repeating the certain number of a~c, the access for obtaining each configuration according to the access times of current configuration inferior division is general Rate, the movement for selecting the highest point of access probability to execute as next step;
Step 3: after sample number reaches given value, the sample being collected into (s, π, z) input policing value network is trained, Training objective is to minimize loss function l=(z-v)2Tlogp+c||θ||2, and neural network ginseng is updated after completing training Number;
The input of the policy value network is current block configuration matrix and object module configuration matrix, is exported as the configuration The probability P and condition grading v of each lower possible action;
The convolution kernel that the current block configuration matrix and object module configuration matrix of input first pass around 256 3 × 3, stride is 1 The convolutional layer of composition, then by batch normalized and nonlinear activation function ReLU output;Subsequent pass through a string of residual errors Module, in each residual error inside modules, input signal successively passes through the convolution being made of 256 3 × 3, the convolution kernel that stride is 1 Layer, nonlinear activation function ReLU, the convolutional layer being made of 256 3 × 3, the convolution kernel that stride is 1, is criticized and is returned batch normalization layer One changes layer, then with the direct-connected Signal averaging of importation, finally exports by nonlinear activation function ReLU;
After the residual error module by a string, signal finally enters output module, and output module is divided into strategy output and value Two parts are exported, wherein tactful output par, c first passes around the convolutional layer comprising two 1 × 1 convolution kernels, then by batch normalizing Change and ReLU activation primitive corresponds to all possible mobile finally by the vector of full articulamentum output N × (N-1) × M dimension The probability of movement;Value output par, c first pass around the convolutional layer comprising 11 × 1 convolution kernel, then by batch normalization and ReLU activation primitive is finally exported by one 1 dimension again followed by the full articulamentum and ReLU activation primitive of 256 dimension outputs Full articulamentum exports the value assessment for current configuration;
Step 4: after parameter updates, carrying out monte carlo search since initial configuration again;Repeat step 2-3, continuous iteration Search for optimal result;After the completion of each search, the least planning path of step number is updated.
2. a kind of via Self-reconfiguration planning side, heterogeneous module robot based on nitrification enhancement according to claim 1 Method, it is characterised in that a string of residual error modules in the step 3 are 3-5.
CN201910523043.0A 2019-06-17 2019-06-17 Self-reconstruction planning method of heterogeneous modular robot based on reinforcement learning algorithm Active CN110297490B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910523043.0A CN110297490B (en) 2019-06-17 2019-06-17 Self-reconstruction planning method of heterogeneous modular robot based on reinforcement learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910523043.0A CN110297490B (en) 2019-06-17 2019-06-17 Self-reconstruction planning method of heterogeneous modular robot based on reinforcement learning algorithm

Publications (2)

Publication Number Publication Date
CN110297490A true CN110297490A (en) 2019-10-01
CN110297490B CN110297490B (en) 2022-06-07

Family

ID=68028152

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910523043.0A Active CN110297490B (en) 2019-06-17 2019-06-17 Self-reconstruction planning method of heterogeneous modular robot based on reinforcement learning algorithm

Country Status (1)

Country Link
CN (1) CN110297490B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909146A (en) * 2019-11-29 2020-03-24 支付宝(杭州)信息技术有限公司 Label pushing model training method, device and equipment for pushing question-back labels
CN111104732A (en) * 2019-12-03 2020-05-05 中国人民解放军国防科技大学 Intelligent planning method for mobile communication network based on deep reinforcement learning
CN111230875A (en) * 2020-02-06 2020-06-05 北京凡川智能机器人科技有限公司 Double-arm robot humanoid operation planning method based on deep learning
CN111679679A (en) * 2020-07-06 2020-09-18 哈尔滨工业大学 Robot state planning method based on Monte Carlo tree search algorithm
CN112264999A (en) * 2020-10-28 2021-01-26 复旦大学 Method, device and storage medium for intelligent agent continuous space action planning
CN113704098A (en) * 2021-08-18 2021-11-26 武汉大学 Deep learning fuzzy test method based on Monte Carlo search tree seed scheduling
CN114020024A (en) * 2021-11-05 2022-02-08 南京理工大学 Unmanned aerial vehicle path planning method based on Monte Carlo tree search

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503373A (en) * 2016-11-04 2017-03-15 湘潭大学 The method for planning track that a kind of Dual-robot coordination based on B-spline curves is assembled
CN106931970A (en) * 2015-12-30 2017-07-07 北京雷动云合智能技术有限公司 Robot security's contexture by self air navigation aid in a kind of dynamic environment
CN107161357A (en) * 2017-04-27 2017-09-15 西北工业大学 A kind of via Self-reconfiguration Method of restructural spacecraft
CN107471206A (en) * 2017-08-16 2017-12-15 大连交通大学 A kind of modularization industrial robot reconfiguration system and its control method
CN107591844A (en) * 2017-09-22 2018-01-16 东南大学 Consider the probabilistic active distribution network robust reconstructing method of node injecting power
WO2018154153A2 (en) * 2017-11-27 2018-08-30 Erle Robotics, S.L. Method for designing modular robots
CN109871943A (en) * 2019-02-20 2019-06-11 华南理工大学 A kind of depth enhancing learning method for big two three-wheel arrangement of pineapple playing card

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106931970A (en) * 2015-12-30 2017-07-07 北京雷动云合智能技术有限公司 Robot security's contexture by self air navigation aid in a kind of dynamic environment
CN106503373A (en) * 2016-11-04 2017-03-15 湘潭大学 The method for planning track that a kind of Dual-robot coordination based on B-spline curves is assembled
CN107161357A (en) * 2017-04-27 2017-09-15 西北工业大学 A kind of via Self-reconfiguration Method of restructural spacecraft
CN107471206A (en) * 2017-08-16 2017-12-15 大连交通大学 A kind of modularization industrial robot reconfiguration system and its control method
CN107591844A (en) * 2017-09-22 2018-01-16 东南大学 Consider the probabilistic active distribution network robust reconstructing method of node injecting power
WO2018154153A2 (en) * 2017-11-27 2018-08-30 Erle Robotics, S.L. Method for designing modular robots
CN109871943A (en) * 2019-02-20 2019-06-11 华南理工大学 A kind of depth enhancing learning method for big two three-wheel arrangement of pineapple playing card

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
FEILIHOU: "Graph-based optimal reconfiguration planning for self-reconfigurable robots", 《ROBOTICS AND AUTONOMOUS SYSTEMS》 *
YIFEI ZHANG: "Reconfiguration Planning for Heterogeneous Cellular", 《PROCEEDINGS OF THE 2017 18TH》 *
苑丹丹: "基于蒙特卡洛法的模块化机器人工作空间分析", 《机床与液压》 *
费燕琼: "自重构模块化机器人的结构", 《上海交通大学学报》 *
黄攀峰: "参数未知航天器的姿态接管控制", 《控制与决策》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909146B (en) * 2019-11-29 2022-09-09 支付宝(杭州)信息技术有限公司 Label pushing model training method, device and equipment for pushing question-back labels
CN110909146A (en) * 2019-11-29 2020-03-24 支付宝(杭州)信息技术有限公司 Label pushing model training method, device and equipment for pushing question-back labels
CN111104732A (en) * 2019-12-03 2020-05-05 中国人民解放军国防科技大学 Intelligent planning method for mobile communication network based on deep reinforcement learning
CN111104732B (en) * 2019-12-03 2022-09-13 中国人民解放军国防科技大学 Intelligent planning method for mobile communication network based on deep reinforcement learning
CN111230875A (en) * 2020-02-06 2020-06-05 北京凡川智能机器人科技有限公司 Double-arm robot humanoid operation planning method based on deep learning
CN111230875B (en) * 2020-02-06 2023-05-12 北京凡川智能机器人科技有限公司 Double-arm robot humanoid operation planning method based on deep learning
CN111679679A (en) * 2020-07-06 2020-09-18 哈尔滨工业大学 Robot state planning method based on Monte Carlo tree search algorithm
WO2022007199A1 (en) * 2020-07-06 2022-01-13 哈尔滨工业大学 Robot state planning method based on monte carlo tree search algorithm
CN112264999A (en) * 2020-10-28 2021-01-26 复旦大学 Method, device and storage medium for intelligent agent continuous space action planning
CN113704098A (en) * 2021-08-18 2021-11-26 武汉大学 Deep learning fuzzy test method based on Monte Carlo search tree seed scheduling
CN113704098B (en) * 2021-08-18 2023-09-22 武汉大学 Deep learning fuzzy test method based on Monte Carlo search tree seed scheduling
CN114020024A (en) * 2021-11-05 2022-02-08 南京理工大学 Unmanned aerial vehicle path planning method based on Monte Carlo tree search
CN114020024B (en) * 2021-11-05 2023-03-31 南京理工大学 Unmanned aerial vehicle path planning method based on Monte Carlo tree search

Also Published As

Publication number Publication date
CN110297490B (en) 2022-06-07

Similar Documents

Publication Publication Date Title
CN110297490A (en) Heterogeneous module robot via Self-reconfiguration planing method based on nitrification enhancement
CN108573303A (en) It is a kind of that recovery policy is improved based on the complex network local failure for improving intensified learning certainly
CN109241291A (en) Knowledge mapping optimal path inquiry system and method based on deeply study
WO2020029583A1 (en) Multiplication and addition calculation method and calculation circuit suitable for neural network
CN105467997A (en) Storage robot path program method based on linear temporal logic theory
CN105509749A (en) Mobile robot path planning method and system based on genetic ant colony algorithm
CN109409510A (en) Neuron circuit, chip, system and method, storage medium
CN108921298A (en) Intensified learning multiple agent is linked up and decision-making technique
CN110188880A (en) A kind of quantization method and device of deep neural network
CN110883776A (en) Robot path planning algorithm for improving DQN under quick search mechanism
CN105978732A (en) Method and system for optimizing parameters of minimum complexity echo state network based on particle swarm
CN103646008A (en) Web service combination method
CN104050505A (en) Multilayer-perceptron training method based on bee colony algorithm with learning factor
CN108536144A (en) A kind of paths planning method of fusion dense convolutional network and competition framework
CN111159489A (en) Searching method
CN113807040A (en) Optimal design method for microwave circuit
Du et al. Application of an improved whale optimization algorithm in time-optimal trajectory planning for manipulators
CN116841303A (en) Intelligent preferential high-order iterative self-learning control method for underwater robot
CN107273970B (en) Reconfigurable platform of convolutional neural network supporting online learning and construction method thereof
CN115327926A (en) Multi-agent dynamic coverage control method and system based on deep reinforcement learning
CN115271254A (en) Short-term wind power prediction method for optimizing extreme learning machine based on gull algorithm
CN114564039A (en) Flight path planning method based on deep Q network and fast search random tree algorithm
CN112001558A (en) Method and device for researching optimal operation mode of power distribution network equipment
Verma et al. A novel evolutionary neural learning algorithm
Han et al. An improved ant colony optimization algorithm based on dynamic control of solution construction and mergence of local search solutions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant