CN112356031B - On-line planning method based on Kernel sampling strategy under uncertain environment - Google Patents

On-line planning method based on Kernel sampling strategy under uncertain environment Download PDF

Info

Publication number
CN112356031B
CN112356031B CN202011220903.2A CN202011220903A CN112356031B CN 112356031 B CN112356031 B CN 112356031B CN 202011220903 A CN202011220903 A CN 202011220903A CN 112356031 B CN112356031 B CN 112356031B
Authority
CN
China
Prior art keywords
belief
node
robot
kernel
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011220903.2A
Other languages
Chinese (zh)
Other versions
CN112356031A (en
Inventor
陈彦杰
黄益斌
林依凡
吴铮
何炳蔚
林立雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202011220903.2A priority Critical patent/CN112356031B/en
Publication of CN112356031A publication Critical patent/CN112356031A/en
Application granted granted Critical
Publication of CN112356031B publication Critical patent/CN112356031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides an online planning method based on a Kernel sampling strategy in an uncertain environment, which is used for planning a robot when executing tasks, wherein in the uncertain environment, uncertainty expressed as a POMDP model is a main factor for restricting the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return; in the online planning method, when an observable part is processed, the state of the robot is expressed as a belief, the belief is recorded as belief, the belief belongs to a state set, and forward search is executed by a POMDP algorithm in a mode of constructing a belief tree so as to obtain an optimal strategy under the current belief; each node of the belief tree represents a belief, and a father node is connected with a child node through a behavior-observation branch; the POMDP algorithm is an online POMDP planning algorithm Kernel-DESPOT; the performance of the algorithm of the invention is superior to that of DESPOT and POMCP, and the convergence rate and quality of the algorithm are superior.

Description

On-line planning method based on Kernel sampling strategy under uncertain environment
Technical Field
The invention relates to the technical field of robots, in particular to an online planning method based on a Kernel sampling strategy in an uncertain environment.
Background
With the rapid development of information technology, robots have gradually merged into daily life. Motion planning has received much attention as an important field of robot research.
The sampling-based motion planning method converts a continuous model into a discrete model, and the traditional sampling-based motion planning algorithms include RRT, PRM, a random potential field method and the like. The RRT algorithm has the characteristics that the path can be quickly found in a complex environment, but the problems of poor path quality, low efficiency in a narrow channel environment, search blindness and the like exist. The PRM and the RRT are complete in probability and are not optimal, the PRM is essentially different from the PRM in the arrangement and search mode of random points, but the imperfect arrangement of discrete points and collision detection cause time consumption. The random potential field method can well overcome the problem of local optimal solution existing when the attraction force and the repulsion force of the traditional potential field method are equal, but the problem that the object collides with an obstacle or is difficult to reach a destination when the object is far away from the target point or the obstacle exists near the target point is inevitable. The uncertainty problem of the model is avoided in the above-mentioned algorithm, but the model is often uncertain in the real environment, decision-making planning has important significance for solving the planning under the existence of uncertainty, and therefore the motion planning research based on reinforcement learning is a good research subject.
Reinforcement learning is a multidomain interdisciplinary discipline, such as: robots, deep learning, control engineering, and the like. In a reinforcement learning system, agents obtain optimal strategies by continually interacting with an uncertain environment to maximize long-term rewards. Therefore, reinforcement learning is an important method for solving the optimal decision problem. For a reinforcement learning system, agent and environment uncertainties can be generally expressed as Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs). The difference is that the agent is partially observable for its own state in the POMDP model, but the state is fully observable in the MDP model.
The early research on MDP problems focused primarily on discrete states and motion spaces. However, in the real task or continuous state space, the learning efficiency and applicability of the algorithm need to be improved. The approximate reinforcement learning method can effectively improve the problems, such as cost function approximation, approximate strategy iteration, actor-evaluation algorithm and the like. The reinforcement learning algorithm not only focuses on a single long-term rewarding target, but also has great development on the reinforcement learning problem of multiple targets, such as improving the flexibility of the longitudinal control of the autonomous land vehicle by the multi-target learning and the like.
The POMDP framework has been able to solve real-world sequence decision problems well. When the robot is executing a task, uncertainty is an important factor for restricting the reliable operation of the robot, and the uncertainty comprises: robot control errors, sensor noise, and rapidly changing environments, among others. Therefore, it is an important issue in the field of robots to implement planning in an uncertain environment having a large state and observation space. POMDP presents a basic framework for planning in an uncertain environment. A basic method of handling partial observability is to represent uncertainty as a Belief (Belief), and for further concreteness, the planning algorithm performs a forward search in a Belief tree, each node of the Belief tree representing a Belief, the parent node being connected to the child nodes by behavior-observation branches. However, planning is computationally infeasible in some worst cases, and the approximately simulated POMDP algorithm has been widely used for many tasks such as resource management, unmanned driving, navigation, and robotic arms. However, POMDP planning has any computational difficulty problems due to "dimensional disaster" and "historical information disaster".
Recent online POMDP algorithms such as DESPOT and pompp use monte carlo methods for belief updates and tree searches to deal with dimensional disaster problems. The algorithms represent the beliefs as a sampling state set to overcome the calculation difficulty caused by large state space, and the DESPOT further realizes the fast approximation calculation of the approximation value of different strategies of the current beliefs through sampling behaviors and observation tracks. In POMDP tasks with large state space, the latest online algorithms can compute a near-optimal strategy. Theoretical analysis shows that a small number of sampling sequences can ensure near-optimal online planning. In practical application, the sampling strategy plays an important role in the probabilistic sampling algorithm, and has a great influence on the performance of the algorithm. The DESPOT and POMCP both adopt a random sampling mode to construct a state set, however, the random sampling has blindness and cannot well represent the convergence direction of the whole belief space, so that an approximate optimal strategy calculated based on the state set is likely to be locally approximate optimal.
The motion planning of robots has been developed for a long time, and many different planning methods such as an artificial potential field-based method, a mathematical model-based method, a graph search-based method, a node-based method, a sampling-based method, and the like have been formed. In order to increase the flexibility and maneuverability of the mechanical arm, redundant mechanical arms and even super-redundant mechanical arms are invented, the degree of freedom of the redundant mechanical arms is more than 6, and the complexity of the mechanical arm motion planning is obviously increased along with the increase of the degree of freedom of the mechanical arm, so the motion planning of the redundant mechanical arm usually needs to consider the problem of dimension disaster. Sampling-based methods have gained widespread attention in the field of robotic arm motion planning due to their significant advantages in multi-dimensional space.
A Kernel sampling strategy is introduced on a DESPOT basic framework, and a Kernel-DESPOT online planning algorithm is proposed. The kernel method has been widely used to improve the utility of many well-known algorithms, such as perceptron networks, support vector machines, and natural language processing. Although the state or environmental information of the robot is partially observable for the online POMDP planning algorithm, the state of the robot may be divided into a partial observable component and an observable component. POMCP makes a better strategy for guiding behavior selection based on historical behavior observation information. We will consider the historical observation information as one of the important components of the current robot observable information. The Kernel of the Kernel sampling strategy is to take the information components observable by the robot in the current state as important influence factors for sampling. The correlation between the observable information of the robot and the state information in the belief space is calculated through a kernel function, and the correlation is used as a sampling basis. The blindness of random sampling is avoided by the state set obtained by the Kernel sampling strategy, the belief space can be represented better, and the probability of local optimal problems caused by the state with poor correlation is reduced. The weight of each sampling state is uniformly distributed when the DESPOT constructs the belief tree, but the importance of the state should be different for the current robot as the environment uncertainty gradually decreases. So each sample we weight the state according to the degree of correlation.
Disclosure of Invention
The invention provides an online planning method based on a Kernel sampling strategy in an uncertain environment, which overcomes the defects of the existing reinforcement learning motion planning algorithm, has better algorithm performance than DESPOT and POMCP, and has advantages in convergence speed and quality.
The invention adopts the following technical scheme.
An online planning method based on a Kernel sampling strategy under an uncertain environment is used for planning a robot when a task is executed under the uncertain environment, and in the uncertain environment, uncertainty expressed as a POMDP model is a main factor for restricting the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return; in the online planning method, when an observable part is processed, the state of the robot is expressed as a belief, the belief is recorded as belief, the belief belongs to a state set, and forward search is executed by a POMDP algorithm in a mode of constructing a belief tree so as to obtain an optimal strategy under the current belief; each node of the belief tree represents a belief, and a father node is connected with a child node through a behavior-observation branch;
the POMDP algorithm is an online POMDP planning algorithm Kernel-DESPOT and comprises the following steps;
step S1, in the current belief space b of the robot, sampling K states according to a Kernel sampling strategy to construct a sampling state set phibAnd distributing the weight to each state;
s2, constructing a belief tree D by taking b as a root node through a Kernel-DESPOT algorithm;
step S3, initializing the experience value of the current beliefs b of the robot
Figure BDA0002771495450000041
Upper bound U (b) and lower bound L (b), and RK-WDU optimum value V*(b) Upper and lower bounds μ (b) and l (b);
step S4, defining the uncertainty of the current belief of the robot as epsilon (b) ← mu (b) -l (b);
step S5, if the uncertainty epsilon (b) is greater than the ideal value oa0And the total running time of the algorithm is less than TmaxThen to the root node b0Carrying out expansion;
step S6, when the belief tree stops expanding, executing BACKUP (D, b);
after completion of the BACKUP (D, b), the uncertainty e (b) of the following node is updated, and it is determined again whether the uncertainty is smaller than the oa0Or whether the running time is greater than TmaxIf the condition is satisfied, the Kernel-DESPOT method returns the value of l (b) of b;
step S7, finally, for the root node b, the algorithm selects an optimal behavior a*So that the belief tree returns l (b) max, i.e. a*←maxa∈Al(b,a);
Optimal behavior a of comparative belief tree computation*Corresponding value l (b, a)*) And by default strategy pi0The size of the initialized value L (b), if L (b) is larger, the optimal behavior is modified to the default policy, i.e. a*←π0(b);
Step S8, the machine repeats the above process until the target point is finally reached.
The specific implementation manner of step S1 is: kernel sampling strategy Kernel function definition
Figure BDA0002771495450000042
Wherein x represents the observable information of the current robot state, and xiRepresenting observable information of the state in the belief space, | | x | | is the norm of x,
Figure BDA0002771495450000043
is a kronecker symbol. K (x, x)i) Denotes x and xiSo that it can be based on K (x, x)i) Sampling K states highly correlated with the current state information; each node b of the Kernel-DESPOT belief tree contains a setPhi (co) phibThe set represents all sequences passing through node b. The initial state of each sequence constitutes a sampling state set; for the current belief b, the starting state s of the sequence phi0Has a weight of
Figure BDA0002771495450000051
Wherein φ e φb,xiIs a state s0The observable fraction of information;
definition of σn 2To measure the noise variance, K (x, x) representing all states of the belief space in the last sample periodi) Variance of the values; definition of σf 2Is the signal variance.
The specific implementation form of step S2 is: by form of calculation of empirical values
Figure BDA0002771495450000052
Wherein Vπ,φRepresenting the computation of each sample sequence by a simulation strategy piφ∈ΦbThe total discount rewards of (a). U (b) is determined by assuming that all states are fully observable, i.e., converting the POMDP problem to an MDP problem, and then calculating the optimal value V in the MDP environmentMDP
Figure BDA0002771495450000053
The lower bound of empirical value is calculated by giving a default policy of π0For each sequence Φ of each node b in DESPOTbAnd simulating a default strategy to explore for a limited time step, calculating the total discount rewards of each sequence, and averaging to obtain the total discount rewards.
The upper and lower limits μ (b) and l (b) of RK-WDU can be determined by U (b) and L (b):
Figure BDA0002771495450000054
Figure BDA0002771495450000055
where γ is a discounting factor, Δ (b) represents the depth in the node b minus tree πbIs a sub-tree at b, | πbI represents pibSize, weighting
Figure BDA0002771495450000056
Empirical estimation of the b-probability of arrival.
The specific implementation form of step S5 is: define b ═ τ (b, a, z) as the child node where b takes action a and obtains observation z arrival. So when performing the spanning tree operation on node b, the values U (b '), L (b '), μ (b '), andl (b ') of all child nodes b ' of all b are initialized first, as shown in step S3; each exploration then aims at reducing the current difference epsilon (b) at the root node b to the target difference ξ epsilon (b), where ξ oa (0,1) is a constant; in the exploration process, the optimal behavior selection of each node b depends on the upper bound μ (b) of b:
Figure BDA0002771495450000061
at execution of a*Thereafter, by selecting observation z that maximizes the excess uncertainty*To obtain a child node b ═ τ (b, a)*,z):
Figure BDA0002771495450000062
The process of expanding the tree is repeated until delta (b) > D represents that the tree is expanded to the maximum depth of the belief tree; or the uncertainty of the node b is reduced to an expected value, namely E (b) < 0, and the continued exploration has no meaning to the node b; there is also a case where the tree expansion stops the forward exploration, i.e. the parent node b of b' already has insufficient sample sequences:
Figure BDA0002771495450000063
l (b, b ') represents the number of nodes on the path from b to b'; if the parent b sampling sequence is not sufficient, then continuing to expand b increases the number of b 'child policy trees, which may cause overfitting and reduce the effect of b' regularization. If the above equation is satisfied in the process of expanding the tree, a pruning PRUNE (D, b) operation needs to be performed.
The specific implementation form of the step operation PRUNE (D, b) is as follows: after the forward search is stopped because the sampling sequence of the parent node b is not enough, the initial lower bound is respectively assigned to the upper bound for the value calculation of the parent node b, namely, the uncertainty of the current node b is indicated to meet the requirement, namely:
U(b)←L(b)
μ(b)←l(b)
BACKUP (D, b) is also performed thereafter.
The specific implementation manner of the step operation BACKUP (D, b) is as follows: when the BACKUP (D, b) needs to be executed, the Kernel-DESPOT follows the Bellman rule to update the value of the nodes in the belief tree from bottom to top along the path:
Figure BDA0002771495450000064
Figure BDA0002771495450000065
Figure BDA0002771495450000071
where b' is a child of b, b ═ τ (b, a, z).
Compared with the prior art, the invention has the following beneficial effects:
compared with the latest online POMDP planning algorithm DESPOT and POMCP algorithm, the algorithm provided by the invention has the following advantages:
(1) the algorithm proposed by the invention is approximately optimal;
(2) the Kernel of the Kernel sampling strategy is to take the information components observable by the robot in the current state as important influence factors for sampling. The correlation between the observable information of the robot and the state information in the belief space is calculated through a kernel function, and the correlation is used as a sampling basis. The blindness of random sampling is avoided by the state set obtained by the Kernel sampling strategy, the belief space can be represented better, and the probability of local optimal problems caused by the state with poor correlation is reduced;
(3) the weight of each sampling state is uniformly distributed when the DESPOT constructs the belief tree, but the importance of the state should be different for the current robot as the environment uncertainty gradually decreases. Therefore, each time the Kernel-DESPOT is sampled, the weight distribution is carried out on the state according to the degree of correlation;
(4) the Kernel sampling strategy and the weight distribution respectively improve the performance of the algorithm.
Drawings
The invention is described in further detail below with reference to the following figures and detailed description:
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of an extended form of the belief tree of the present invention (solid circles represent belief nodes, black squares represent belief-behavior nodes; tree, different solid circles represent states sampled according to a sampling strategy, solid triangles of different sizes represent weights of the states, different trajectories represent sampling sequences corresponding to the sampling states);
fig. 3 is a schematic diagram of four common POMDP evaluation simulation environments (four main environments are (a) tag. robot chases a target to escape, (b) lasertag. robot chases a target in a grid with randomly scattered obstacles, and the robot carries radar for searching the target and finding the obstacle. (c) rocksample. the robot rover can detect rocks to identify and sample "good" rocks.
FIG. 4 is a schematic diagram showing the effect of different sampling sequences on algorithm performance in the present invention ((a) Tag. (b) Lasertag. (c) RockSample.);
FIG. 5 is a graphical representation of the runtime of the present invention after normalization (average runtime comparison of different task algorithms);
FIG. 6 is a schematic diagram of the normalized convergence performance of the present invention (comparison of convergence degrees of different task algorithms);
FIG. 7 is an algorithm performance representation intent of the present invention in a common POMDP rating simulation environment (showing the average discount awards for two algorithms over six different tasks);
FIG. 8 is a diagram showing the effect of the present invention on the performance table of the algorithm (Kernel-Despot contains two main parts: weight assignment for each state in the Sampling state set according to the Kernel Sampling Strategy. weighted means Sampling only according to Kernel without weight assignment, and Kernel Sampling Strategy means Sampling Strategy with random Sampling but weight assignment for the randomly sampled state set).
Detailed Description
As shown in the figure, the on-line planning method based on the Kernel sampling strategy under the uncertain environment is used for planning the robot when a task is executed under the uncertain environment, and in the uncertain environment, the uncertainty expressed as a POMDP model is a main factor for restricting the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return;
in the online planning method, when an observable part is processed, the state of the robot is expressed as a belief, the belief is recorded as belief, the belief belongs to a state set, and forward search is executed by a POMDP algorithm in a mode of constructing a belief tree so as to obtain an optimal strategy under the current belief; each node of the belief tree represents a belief, and a father node is connected with a child node through a behavior-observation branch;
the POMDP algorithm is an online POMDP planning algorithm Kernel-DESPOT and comprises the following steps;
step S1, in the current belief space b of the robotIn the method, K states are sampled according to a Kernel sampling strategy to construct a sampling state set phibAnd distributing the weight to each state;
s2, constructing a belief tree D by taking b as a root node through a Kernel-DESPOT algorithm;
step S3, initializing the experience value of the current beliefs b of the robot
Figure BDA0002771495450000081
Upper bound U (b) and lower bound L (b), and RK-WDU optimum value V*(b) Upper and lower bounds μ (b) and l (b);
step S4, defining the uncertainty of the current belief of the robot as epsilon (b) ← mu (b) -l (b);
step S5, if the uncertainty epsilon (b) is greater than the ideal value oa0And the total running time of the algorithm is less than TmaxThen to the root node b0Carrying out expansion;
step S6, when the belief tree stops expanding, executing BACKUP (D, b);
after completion of the BACKUP (D, b), the uncertainty e (b) of the following node is updated, and it is determined again whether the uncertainty is smaller than the oa0Or whether the running time is greater than TmaxIf the condition is satisfied, the Kernel-DESPOT method returns the value of l (b) of b;
step S7, finally, for the root node b, the algorithm selects an optimal behavior a*So that the belief tree returns l (b) max, i.e. a*←maxa∈Al(b,a);
Optimal behavior a of comparative belief tree computation*Corresponding value l (b, a)*) And by default strategy pi0The size of the initialized value L (b), if L (b) is larger, the optimal behavior is modified to the default policy, i.e. a*←π0(b);
Step S8, the machine repeats the above process until the target point is finally reached.
The specific implementation manner of step S1 is: kernel sampling strategy Kernel function definition
Figure BDA0002771495450000091
Wherein x represents the observable information of the current robot state, and xiRepresenting observable information of the state in the belief space, | | x | | is the norm of x,
Figure BDA0002771495450000092
is a kronecker symbol. K (x, x)i) Denotes x and xiSo that it can be based on K (x, x)i) Sampling K states highly correlated with the current state information; each node b of the Kernel-DESPOT belief tree contains a set phibThe set represents all sequences passing through node b. The initial state of each sequence constitutes a sampling state set; for the current belief b, the starting state s of the sequence phi0Has a weight of
Figure BDA0002771495450000093
Wherein φ e φb,xiIs a state s0The observable fraction of information;
definition of σn 2To measure the noise variance, K (x, x) representing all states of the belief space in the last sample periodi) Variance of the values; definition of σf 2Is the signal variance.
The specific implementation form of step S2 is: by form of calculation of empirical values
Figure BDA0002771495450000101
Wherein Vπ,φRepresenting the computation of each sample sequence by a simulation strategy piφ∈ΦbThe total discount rewards of (a). U (b) is determined by assuming that all states are fully observable, i.e., converting the POMDP problem to an MDP problem, and then calculating the optimal value V in the MDP environmentMDP
Figure BDA0002771495450000102
The lower bound of empirical value is calculated by giving a default policy of π0For each sequence Φ of each node b in DESPOTbAnd simulating a default strategy to explore for a limited time step, calculating the total discount rewards of each sequence, and averaging to obtain the total discount rewards.
The upper and lower limits μ (b) and l (b) of RK-WDU can be determined by U (b) and L (b):
Figure BDA0002771495450000103
Figure BDA0002771495450000104
where γ is a discounting factor, Δ (b) represents the depth in the node b minus tree πbIs a sub-tree at b, | πbI represents pibSize, weighting
Figure BDA0002771495450000105
Empirical estimation of the b-probability of arrival.
The specific implementation form of step S5 is: define b ═ τ (b, a, z) as the child node where b takes action a and obtains observation z arrival. So when performing the spanning tree operation on node b, the values U (b '), L (b '), μ (b '), andl (b ') of all child nodes b ' of all b are initialized first, as shown in step S3; each exploration then aims at reducing the current difference epsilon (b) at the root node b to the target difference ξ epsilon (b), where ξ oa (0,1) is a constant; in the exploration process, the optimal behavior selection of each node b depends on the upper bound μ (b) of b:
Figure BDA0002771495450000106
at execution of a*Thereafter, by selecting observation z that maximizes the excess uncertainty*To obtain a child node b ═ τ (b, a)*,z):
Figure BDA0002771495450000111
The process of expanding the tree is repeated until delta (b) > D represents that the tree is expanded to the maximum depth of the belief tree; or the uncertainty of the node b is reduced to an expected value, namely E (b) < 0, and the continued exploration has no meaning to the node b; there is also a case where the tree expansion stops the forward exploration, i.e. the parent node b of b' already has insufficient sample sequences:
Figure BDA0002771495450000112
l (b, b ') represents the number of nodes on the path from b to b'; if the parent b sampling sequence is not sufficient, then continuing to expand b increases the number of b 'child policy trees, which may cause overfitting and reduce the effect of b' regularization. If the above equation is satisfied in the process of expanding the tree, a pruning PRUNE (D, b) operation needs to be performed.
The specific implementation form of the step operation PRUNE (D, b) is as follows: after the forward search is stopped because the sampling sequence of the parent node b is not enough, the initial lower bound is respectively assigned to the upper bound for the value calculation of the parent node b, namely, the uncertainty of the current node b is indicated to meet the requirement, namely:
U(b)←L(b)
μ(b)←l(b)
BACKUP (D, b) is also performed thereafter.
The specific implementation manner of the step operation BACKUP (D, b) is as follows: when the BACKUP (D, b) needs to be executed, the Kernel-DESPOT follows the Bellman rule to update the value of the nodes in the belief tree from bottom to top along the path:
Figure BDA0002771495450000113
Figure BDA0002771495450000114
Figure BDA0002771495450000115
where b' is a child of b, b ═ τ (b, a, z).
Example (b):
in this embodiment, the complete pseudo code of the algorithm is specifically as follows:
Figure BDA0002771495450000121
Figure BDA0002771495450000122
Figure BDA0002771495450000123
Figure BDA0002771495450000124
Figure BDA0002771495450000125
Figure BDA0002771495450000126
the embodiment of the invention is explained in detail by using specific experiments, the invention provides a planning method under an uncertain environment based on a Kernel sampling strategy, and for POMDP planning problems, some standard POMDP simulation evaluation environments exist, so the performance of the Kernel-DESPOT algorithm is mainly tested through a common simulation evaluation environment and compared with the latest POMDP algorithm. The specific experimental setup was as follows:
simulation experiment:
the simulation experiment was performed in the Ubuntu system.
We evaluated the performance of the Kernel-DESPOT algorithm (table 1) in six simulation tasks, which are common evaluation criteria for evaluating the POMDP algorithm. Setting a sampling sequence K to be 500 in an environment Tag, a Lasertag and a Rocksample, setting K to be 100 in an environment Pocman, and setting a calculation method of an initial upper boundary and a calculation method of an initial lower boundary according to different environment specific requirements as a heuristic. All algorithms were simulated on a unified platform and the maximum run time of the online POMDP algorithm was set to 1 second. Kernel-DESPOT consists essentially of two parts: kernel-based correlation sampling and weight reassignment of the sampling state set. We show the effect of these two parts on the performance of the algorithm separately by table 2. To further illustrate the performance of the Kernel-DESPOT algorithm, we look from fig. 4: the invention is a schematic diagram of the influence of different sampling sequences on algorithm performance, and fig. 5: is a schematic diagram of the invention after normalization of the run time, fig. 6: the three angles are schematic diagrams after the convergence performance of the invention is normalized, and the advantages of Kernel-DESPOT are explained. According to
Figure BDA0002771495450000131
It can be seen that the variable x, i.e. the information available for the current robot state, is the key to compute the kernel function. For each different ROMDP environment, the information available at the current state of the robot is different, so the variable x is dependent on the characteristics of the environment. But we all convert the information available from the current state of the robot into a positional relationship, so x is actually a position vector. The position vector is then used for correlation calculations with states in the belief space.
(1) Simulation environment one: TAG
Tag is the standard POMDP benchmark set forth by Pineau et al in 2003. One robot and one object walk in 29 possible grid positions (fig. 3 a). The goal of the robot is to find and grasp an object that always wants to escape the robot. Their inspired positions are random, the robot knows its own position, but only observes the target at the same position as the target. The robot may be stationary or move to four adjacent positions, each of which would consume a reward of-1. In addition, the robot can perform a "tag" action, such as the robot obtains a +10 reward if the robot is in the same location as the target, otherwise the reward value is-10.
In the Tag environment, since the observation value of the robot only contains the position information of itself, we consider the role of the historical observation information when constructing the vector x. The current robot position and the robot position in the historical observation information form a relative position vector x, along with the increase of the historical observation information, only the historical information in three time steps is considered in order to reduce the calculated amount, and if the current robot position has great deviation from the position vector x constructed by the historical information in the three time steps, the information close to the time steps is taken as the standard. The above method is mainly to avoid the robot repeatedly searching the same area in a short time.
(2) Simulation scene two: LaserTag
LaserTag is an extended version of the Tag environment and is characterized by a large observation space. In the LaserTag environment, the robot moves in a 7 × 11 rectangular grid and randomly places 8 obstacles (fig. 3 b). The robot and the target object keep the same behavior, except that the robot does not know the position of the robot exactly, and the position information is uniformly distributed in the grid at the beginning. In order to realize positioning, the robot carries a radar to measure the distance in 8 directions. The radar readings in each direction are a standard positive-error distribution centered at the true distance of the robot from the nearest obstacle in that direction, with a standard variance of 2.5.
In the LaserTag environment, we construct the vector x from the observations of the radar. The robot can recognize the closest distance to the obstacle or target in eight directions by radar, but this closest distance is presented in a standard positive distribution. So when the target object appears in the eight directions observed by the current robot radar, we construct a relative position vector x with the closest distance where the probability of the direction in which the target object is located is the greatest. And then, calculating the similarity between the belief space state and the current robot state according to the kernel function to construct a sampling state set. Even if the robot does not know the current position and its observation is in error, we can still approximate a relative position vector x of the target and the robot. Although the position of the target object is judged to have an error with the position of the real target object according to the observation result, the method provides important approximate direction information of the target object, and therefore the sampling is avoided from being in a state of deviating from the direction to a large extent.
(3) A third simulation environment: RockSample
RockSample (Smith & Simmons, 2004) is a maturity assessment criterion with a large state space (fig. 3 c). RS (n, k), indicating that the robotic rover seeks in an n × n grid containing k stones, which may be "good" or "bad". The purpose of the robot is to search for these stones and sample the ones that are good, exiting from the far right of the grid when all searches are completed. At each step, the actions that the robot can take include: move to the adjacent grid, sense the quality of a stone, and sample a stone. When performing the sampling action, a prize of +10 is obtained if the stone is good, otherwise the prize value is-10. The act of performing the movement or perception does not have any reward, the movement or sampling act does not produce an observation, only the perception act produces an observation: good or bad, the accuracy of the observation decreases exponentially with the distance of the robot from the stone. To obtain a higher reward, the robot navigates through the grid and perceives the quality of the stone, while at the same time accessing and sampling the good stone with the information obtained.
In a RockSample environment, the idea of constructing a position vector x is simple, that is, if the last behavior is perception and the obtained observation value is good, the position vector x is constructed by using the position of the stone and the position of the current robot, and then a state with high correlation with x is sampled in a belief space. The position vector will remain unchanged until the next time that the perceptual behavior is performed. If the robot continuously executes the perception behaviors and the obtained observation values are all good, the position vector x is constructed by taking the observation with the highest accuracy as a basis. The core idea of the method is that the robot believes the observation result, the observation information is used for guiding sampling, states with low similarity with the observation information are regarded as useless states, and the states do not play a positive role in strategy calculation and can cause a local optimal problem.
(4) And (4) simulating an environment: pocman
Pocman(Silver&Veness, 2010) a part of a popular video game of the same name observable variant (FIG. 3 d). The agent and four sprites walk in a 17 x 19 maze with food distributed in the maze. Each move of the agent will have a penalty of-1, with each food representing a prize of +10, if the agent is terminated by the sprite capture game and a penalty of-100 is incurred. In addition, four strong foods exist in the maze, and the agent can catch the genius within 15 steps after eating the strong foods, so that the agent can obtain a reward of +25 when successfully catching the genius. The Manhattan distance between the genius and the intelligent agent is within 5, and the genius can catch the intelligent agent; but if the agent eats a strong food, the sprite will be far away from the agent. The agent does not know the specific location of the puck, but can obtain the following information: whether the main direction of the intelligent body has the genius, whether the Manhattan distance is 2, whether the main direction of the intelligent body has the wall, and whether the adjacent or diagonal adjacent position has food. Pocman has a very large state space, close to 1056And (4) a state.
In a Pocman environment, the construction of the position vector x is easy, since the agent can know whether there is food in a neighboring or diagonally neighboring position. If there is food in an adjacent or diagonally adjacent location, then a vector x is constructed with the location of the food and the current agent location, and the adjacent location is taken as priority. According to the method, the blindness of intelligent agent walking caused by random sampling can be effectively avoided.
(5) The main contribution of Kernel Sample Strategy and StateReweighing to algorithm performance impact comparison with Kernel-DESPOT is to propose Kernel function-based
Figure BDA0002771495450000157
And weight assignment of the sampling state set. To better illustrate the impact of these two components on algorithm performance, we tested the average discount reward in six environments according to the control variable method. If the observation space of a POMDP environment is larger, the corresponding belief tree scale is also larger, the overfitting phenomenon caused by random sampling in the environment is more obvious, and the overfitting phenomenon is shown in that the robot is easy to fall into local optimum. The Kernel-DESPOT utilizes the similarity between the belief space state and the current state as a sampling basis to construct a sampling state set, and distributes sampling state weights according to the similarity. We expect that the local optimality problem caused by overfitting can be improved according to the above method. As can be seen from Table 2, in the Tag, LaserTag and Pocman environments with larger observation spaces, the only basis is
Figure BDA0002771495450000151
The performance of the algorithm is remarkably improved by sampling or randomly sampling but performing weight distribution on a sampling state set, and the superiority of Kernel-DESPOT is gradually highlighted along with the increase of the number of states in a RockSample environment.
(6) Kernel-DESPOT runtime and convergence Performance comparison
For each node b, assume an initial upper bound U0Is approximated by an error δ:
Figure BDA0002771495450000152
if the error δ of the initial upper bound is 0, it is
Figure BDA0002771495450000153
The true upper bound of (c).
Let T bemaxThe method is limited, and a partial belief tree DESPOT is constructed by a real-time DESPOT algorithm
Figure BDA00027714954500001510
Root node b0The difference between the upper and lower bounds is epsilon (b)0). Then originate from
Figure BDA0002771495450000154
Regularization optimization strategy of
Figure BDA0002771495450000155
Satisfies the following conditions:
Figure BDA0002771495450000156
v*(b0) The regularization optimal strategy is calculated by the complete Kernel-DESPOT belief tree. Epsilon (b)0) With TmaxIncrease and monotonically decrease, and thus
Figure BDA0002771495450000158
And gradually approaches the optimal regularization strategy with the increase of time. And the initial upper bound error affects the final result by a maximum of δ.
The sampling state set obtained according to Kernel-DESPOT is more consistent with the convergence direction of the belief space, so that epsilon (b) is within a limited time0) The more rapid the decrease is, the faster it is,
Figure BDA0002771495450000159
the method can converge to the optimal strategy more quickly, and the Kernel-DESPOT is faster in overall calculation time. Fig. 5 shows a comparison of the running times of different algorithms of the same task, and the ordinate represents the proportion of the running times of the different algorithms of the same task after the normalization process. Overall, the computation speed of the Kernel-DESPOT algorithm is advantageous. Since Kernel-DESPOT is an online POMDP planning algorithm, the maximum computation time of each time step of the algorithm is 1s, so that the algorithm can hardly reduce the uncertainty of the current belief to 0 in a limited time, that is, generally, epsilon (b)0) Is greater than 0. And epsilon (b)0) Reflects the convergence of the algorithm, so we average epsilon (b) in each time step0) Value as a measure of algorithm convergenceThe energy index is shown in FIG. 6. Kernel-DESPOT and convergence degree in different tasks show that the ordinate represents the average epsilon of different algorithms of the same task after normalization (b 0))Ratio of occupation.

Claims (5)

1. An on-line planning method based on a Kernel sampling strategy under an uncertain environment is used for planning a robot when a task is executed under the uncertain environment, and is characterized in that: in the uncertainty environment, the uncertainty expressed as the POMDP model is the main factor that restricts the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return;
in the online planning method, when an observable part is processed, the state of the robot is expressed as a belief, the belief is recorded as belief, the belief belongs to a state set, and forward search is executed by a POMDP algorithm in a mode of constructing a belief tree so as to obtain an optimal strategy under the current belief; each node of the belief tree represents a belief, and a father node is connected with a child node through a behavior-observation branch;
the POMDP algorithm is an online POMDP planning algorithm Kernel-DESPOT and comprises the following steps;
step S1, in the current belief space b of the robot, sampling K states according to a Kernel sampling strategy to construct a sampling state set phibAnd distributing the weight to each state;
s2, constructing a belief tree D by taking b as a root node through a Kernel-DESPOT algorithm;
step S3, initializing the experience value of the current beliefs b of the robot
Figure FDA0003472405240000011
Upper bound U (b) and lower bound L (b), and RK-WDU optimum value V*(b) Upper and lower bounds μ (b) and l (b);
step S4, defining the uncertainty of the current belief of the robot as epsilon (b) ← mu (b) -l (b);
step S5, if the uncertainty ε (b) is largeAt the ideal value
Figure FDA0003472405240000012
And the total running time of the algorithm is less than TmaxThen to the root node b0Carrying out expansion;
step S6, when the belief tree stops expanding, executing BACKUP (D, b);
after completion of the BACKUP (D, b), the uncertainty e (b) of the root node is updated, and it is determined again whether the uncertainty is smaller than oa0Or whether the running time is greater than TmaxIf the condition is satisfied, the Kernel-DESPOT algorithm returns the value of l (b) for b;
step S7, finally, for the root node b, the algorithm selects an optimal behavior a*So that the belief tree returns l (b) max, i.e. a*←maxa∈Al(b,a);
Optimal behavior a of comparative belief tree computation*Corresponding value l (b, a)*) And by default strategy pi0The size of the initialized value L (b), if L (b) is larger, the optimal behavior is modified to the default policy, i.e. a*←π0(b);
Step S8, repeating the above steps by the machine until the target point is finally reached;
the specific implementation manner of step S1 is: kernel sampling strategy Kernel function definition
Figure FDA0003472405240000021
Wherein the content of the first and second substances,
Figure FDA0003472405240000022
what is represented in the kernel function is the transpose of the vector; x represents the observable information of the current robot state, xiRepresenting observable information of the state in the belief space, | | x | | is the norm of x,
Figure FDA0003472405240000023
is a kronecker symbol; k (x, x)i) Denotes x and xiSo that it can be based on K (x, x)i) Sampling K states highly correlated with the current state information; each node b of the Kernel-DESPOT belief tree contains a set phibThe set represents all sequences passing through node b; the initial state of each sequence constitutes a sampling state set; for the current belief b, the starting state s of the sequence phi0Has a weight of
Figure FDA0003472405240000024
Wherein φ e φb,xiIs a state s0The observable fraction of information;
definition of σn 2To measure the noise variance, K (x, x) representing all states of the belief space in the last sample periodi) Variance of the values; definition of σf 2Is the signal variance.
2. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 1 is characterized in that: the specific implementation form of step S3 is: by form of calculation of empirical values
Figure FDA0003472405240000025
Wherein Vπ,φMeans that each sampling sequence phi epsilon is calculated by a simulation strategy pibThe total discount rewards of (1); thus, it is possible to provide
Figure FDA0003472405240000026
Expressing the experience value obtained by executing the strategy pi at the current belief b; u (b) is determined by assuming that all states are fully observable, i.e., converting the POMDP problem to an MDP problem, and then calculating the optimal value V in the MDP environmentMDP
Figure FDA0003472405240000027
The lower bound of empirical value is calculated by giving a default policy of π0For each sequence Φ of each node b in DESPOTbSimulating a default strategy to explore time steps for a limited number of times, calculating the total discount reward of each sequence, averaging to obtain sφIs the starting state s in the sequence phi0
The upper and lower limits μ (b) and l (b) of RK-WDU can be determined by U (b) and L (b):
Figure FDA0003472405240000031
Figure FDA0003472405240000032
where γ is a discounting factor, Δ (b) represents the depth in the node b minus tree π, weighted
Figure FDA0003472405240000033
The empirical estimation of the probability of b is achieved, lambda is more than 0, the regularization term is used, the overfitting problem in the belief tree construction process can be effectively avoided,
Figure FDA0003472405240000034
for the core weight discount effect, formula four and formula five are the embodiment of the RK-WDU idea in the algorithm.
3. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 2 is characterized in that: the specific implementation form of step S5 is: defining b ═ τ (b, a, z) as the child node where b takes action a and obtains observation z arrival; so when performing the spanning tree operation on node b, the values U (b '), L (b '), μ (b '), and L (b ') of all child nodes b ' of all b are initialized first, as shown in step S3; each exploration then aims at reducing the current difference epsilon (b) at the root node b to the target difference ξ epsilon (b), where ξ oa (0,1) is a constant;
in the exploration process, the optimal behavior selection of each node b depends on the upper bound μ (b) of b:
Figure FDA0003472405240000035
rho (b, a) represents a construction process for introducing a regularization term into the belief tree; at execution of a*Thereafter, by selecting observation z that maximizes the excess uncertainty*To obtain a child node b ═ τ (b, a)*,z):
Figure FDA0003472405240000036
The process of expanding the tree is repeated until delta (b) > D represents that the tree is expanded to the maximum depth of the belief tree; or the uncertainty of the node b is reduced to an expected value, namely E (b) < 0, and the continued exploration has no meaning to the node b; there is also a case where the tree expansion stops the forward exploration, i.e. the parent node b of b' already has insufficient sample sequences:
Figure FDA0003472405240000041
l (b, b ') represents the number of nodes on the path from b to b'; if the parent node b sampling sequence is not enough, then continuing to expand b increases the number of b 'sub-policy trees, which may cause overfitting and reduce the effect of b' regularization; if the above equation is satisfied in the process of expanding the tree, a pruning PRUNE (D, b) operation needs to be performed.
4. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 3 is characterized in that: the specific implementation form of the step operation PRUNE (D, b) is as follows: after the forward search is stopped because the sampling sequence of the parent node b is not enough, the initial lower bound is respectively assigned to the upper bound for the value calculation of the parent node b, namely, the uncertainty of the current node b is indicated to meet the requirement, namely:
U(b)←L(b)
μ(b)←l(b)
BACKUP (D, b) is also performed thereafter.
5. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 4 is characterized in that: the specific implementation manner of the step operation BACKUP (D, b) is as follows: when the BACKUP (D, b) needs to be executed, the Kernel-DESPOT follows the Bellman rule to update the value of the nodes in the belief tree from bottom to top along the path:
Figure FDA0003472405240000042
Figure FDA0003472405240000043
Figure FDA0003472405240000044
where b' is a child of b, b ═ τ (b, a, z).
CN202011220903.2A 2020-11-11 2020-11-11 On-line planning method based on Kernel sampling strategy under uncertain environment Active CN112356031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011220903.2A CN112356031B (en) 2020-11-11 2020-11-11 On-line planning method based on Kernel sampling strategy under uncertain environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011220903.2A CN112356031B (en) 2020-11-11 2020-11-11 On-line planning method based on Kernel sampling strategy under uncertain environment

Publications (2)

Publication Number Publication Date
CN112356031A CN112356031A (en) 2021-02-12
CN112356031B true CN112356031B (en) 2022-04-01

Family

ID=74514086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011220903.2A Active CN112356031B (en) 2020-11-11 2020-11-11 On-line planning method based on Kernel sampling strategy under uncertain environment

Country Status (1)

Country Link
CN (1) CN112356031B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113189985B (en) * 2021-04-16 2022-09-23 南京大学 Partially observable driving planning method based on adaptive particle and belief filling
CN114118441A (en) * 2021-11-24 2022-03-01 福州大学 Online planning method based on efficient search strategy under uncertain environment
CN115338862B (en) * 2022-08-16 2024-05-28 哈尔滨工业大学 Manipulator movement path planning method based on partially observable Markov

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929281A (en) * 2012-11-05 2013-02-13 西南科技大学 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
CN105930944A (en) * 2016-07-12 2016-09-07 中国人民解放军空军装备研究院雷达与电子对抗研究所 DEC-POMDP-based collaborative optimization decision method and device
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN108282587A (en) * 2018-01-19 2018-07-13 重庆邮电大学 Mobile customer service dialogue management method under being oriented to strategy based on status tracking
CN108680155A (en) * 2018-02-01 2018-10-19 苏州大学 The robot optimum path planning method of mahalanobis distance map process is perceived based on part
CN109655066A (en) * 2019-01-25 2019-04-19 南京邮电大学 One kind being based on the unmanned plane paths planning method of Q (λ) algorithm
CN110989602A (en) * 2019-12-12 2020-04-10 齐鲁工业大学 Method and system for planning paths of autonomous guided vehicle in medical pathological examination laboratory

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103926933A (en) * 2014-03-29 2014-07-16 北京航空航天大学 Indoor simultaneous locating and environment modeling method for unmanned aerial vehicle
US9272418B1 (en) * 2014-09-02 2016-03-01 The Johns Hopkins University System and method for flexible human-machine collaboration
US11400587B2 (en) * 2016-09-15 2022-08-02 Google Llc Deep reinforcement learning for robotic manipulation
CN107292344B (en) * 2017-06-26 2020-09-18 苏州大学 Robot real-time control method based on environment interaction
CN108638054B (en) * 2018-04-08 2021-05-04 河南科技学院 Control method for intelligent explosive disposal robot five-finger dexterous hand
EP3628453A1 (en) * 2018-09-28 2020-04-01 Siemens Aktiengesellschaft A control system and method for a robot
CN109672406B (en) * 2018-12-20 2020-07-07 福州大学 Photovoltaic power generation array fault diagnosis and classification method based on sparse representation and SVM
CN110909605B (en) * 2019-10-24 2022-04-26 西北工业大学 Cross-modal pedestrian re-identification method based on contrast correlation

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929281A (en) * 2012-11-05 2013-02-13 西南科技大学 Robot k-nearest-neighbor (kNN) path planning method under incomplete perception environment
CN105930944A (en) * 2016-07-12 2016-09-07 中国人民解放军空军装备研究院雷达与电子对抗研究所 DEC-POMDP-based collaborative optimization decision method and device
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN108282587A (en) * 2018-01-19 2018-07-13 重庆邮电大学 Mobile customer service dialogue management method under being oriented to strategy based on status tracking
CN108680155A (en) * 2018-02-01 2018-10-19 苏州大学 The robot optimum path planning method of mahalanobis distance map process is perceived based on part
CN109655066A (en) * 2019-01-25 2019-04-19 南京邮电大学 One kind being based on the unmanned plane paths planning method of Q (λ) algorithm
CN110989602A (en) * 2019-12-12 2020-04-10 齐鲁工业大学 Method and system for planning paths of autonomous guided vehicle in medical pathological examination laboratory

Also Published As

Publication number Publication date
CN112356031A (en) 2021-02-12

Similar Documents

Publication Publication Date Title
CN112356031B (en) On-line planning method based on Kernel sampling strategy under uncertain environment
Zhu et al. Deep reinforcement learning based mobile robot navigation: A review
Chen et al. POMDP-lite for robust robot planning under uncertainty
CN110750096B (en) Mobile robot collision avoidance planning method based on deep reinforcement learning in static environment
Wurm et al. Bridging the gap between feature-and grid-based SLAM
CN110083165A (en) A kind of robot paths planning method under complicated narrow environment
CN113110509A (en) Warehousing system multi-robot path planning method based on deep reinforcement learning
Mohanty et al. Cuckoo search algorithm for the mobile robot navigation
Chatterjee et al. A Geese PSO tuned fuzzy supervisor for EKF based solutions of simultaneous localization and mapping (SLAM) problems in mobile robots
Hagras et al. A fuzzy-genetic based embedded-agent approach to learning and control in agricultural autonomous vehicles
Gao et al. Path planning algorithm of robot arm based on improved RRT* and BP neural network algorithm
Jiang et al. iTD3-CLN: Learn to navigate in dynamic scene through Deep Reinforcement Learning
CN117590867B (en) Underwater autonomous vehicle connection control method and system based on deep reinforcement learning
Riviere et al. Neural tree expansion for multi-robot planning in non-cooperative environments
Tan et al. On-Policy deep reinforcement learning approach to multi agent problems
Wang et al. Robot navigation by waypoints
CN114118441A (en) Online planning method based on efficient search strategy under uncertain environment
Raiesdana A hybrid method for industrial robot navigation
CN116430891A (en) Deep reinforcement learning method oriented to multi-agent path planning environment
Bar et al. Deep Reinforcement Learning Approach with adaptive reward system for robot navigation in Dynamic Environments
Chen et al. Deep reinforcement learning-based robot exploration for constructing map of unknown environment
Castellini et al. Online Monte Carlo Planning for Autonomous Robots: Exploiting Prior Knowledge on Task Similarities.
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial
Wurm et al. Improved Simultaneous Localization and Mapping using a Dual Representation of the Environment.
Toan et al. Environment exploration for mapless navigation based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant