CN112356031B

CN112356031B - On-line planning method based on Kernel sampling strategy under uncertain environment

Info

Publication number: CN112356031B
Application number: CN202011220903.2A
Authority: CN
Inventors: 陈彦杰; 黄益斌; 林依凡; 吴铮; 何炳蔚; 林立雄
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-04-01
Anticipated expiration: 2040-11-11
Also published as: CN112356031A

Abstract

The invention provides an online planning method based on a Kernel sampling strategy in an uncertain environment, which is used for planning a robot when executing tasks, wherein in the uncertain environment, uncertainty expressed as a POMDP model is a main factor for restricting the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return; in the online planning method, when an observable part is processed, the state of the robot is expressed as a belief, the belief is recorded as belief, the belief belongs to a state set, and forward search is executed by a POMDP algorithm in a mode of constructing a belief tree so as to obtain an optimal strategy under the current belief; each node of the belief tree represents a belief, and a father node is connected with a child node through a behavior-observation branch; the POMDP algorithm is an online POMDP planning algorithm Kernel-DESPOT; the performance of the algorithm of the invention is superior to that of DESPOT and POMCP, and the convergence rate and quality of the algorithm are superior.

Description

On-line planning method based on Kernel sampling strategy under uncertain environment

Technical Field

The invention relates to the technical field of robots, in particular to an online planning method based on a Kernel sampling strategy in an uncertain environment.

Background

With the rapid development of information technology, robots have gradually merged into daily life. Motion planning has received much attention as an important field of robot research.

The sampling-based motion planning method converts a continuous model into a discrete model, and the traditional sampling-based motion planning algorithms include RRT, PRM, a random potential field method and the like. The RRT algorithm has the characteristics that the path can be quickly found in a complex environment, but the problems of poor path quality, low efficiency in a narrow channel environment, search blindness and the like exist. The PRM and the RRT are complete in probability and are not optimal, the PRM is essentially different from the PRM in the arrangement and search mode of random points, but the imperfect arrangement of discrete points and collision detection cause time consumption. The random potential field method can well overcome the problem of local optimal solution existing when the attraction force and the repulsion force of the traditional potential field method are equal, but the problem that the object collides with an obstacle or is difficult to reach a destination when the object is far away from the target point or the obstacle exists near the target point is inevitable. The uncertainty problem of the model is avoided in the above-mentioned algorithm, but the model is often uncertain in the real environment, decision-making planning has important significance for solving the planning under the existence of uncertainty, and therefore the motion planning research based on reinforcement learning is a good research subject.

Reinforcement learning is a multidomain interdisciplinary discipline, such as: robots, deep learning, control engineering, and the like. In a reinforcement learning system, agents obtain optimal strategies by continually interacting with an uncertain environment to maximize long-term rewards. Therefore, reinforcement learning is an important method for solving the optimal decision problem. For a reinforcement learning system, agent and environment uncertainties can be generally expressed as Markov Decision Processes (MDPs) and Partially Observable Markov Decision Processes (POMDPs). The difference is that the agent is partially observable for its own state in the POMDP model, but the state is fully observable in the MDP model.

The early research on MDP problems focused primarily on discrete states and motion spaces. However, in the real task or continuous state space, the learning efficiency and applicability of the algorithm need to be improved. The approximate reinforcement learning method can effectively improve the problems, such as cost function approximation, approximate strategy iteration, actor-evaluation algorithm and the like. The reinforcement learning algorithm not only focuses on a single long-term rewarding target, but also has great development on the reinforcement learning problem of multiple targets, such as improving the flexibility of the longitudinal control of the autonomous land vehicle by the multi-target learning and the like.

The POMDP framework has been able to solve real-world sequence decision problems well. When the robot is executing a task, uncertainty is an important factor for restricting the reliable operation of the robot, and the uncertainty comprises: robot control errors, sensor noise, and rapidly changing environments, among others. Therefore, it is an important issue in the field of robots to implement planning in an uncertain environment having a large state and observation space. POMDP presents a basic framework for planning in an uncertain environment. A basic method of handling partial observability is to represent uncertainty as a Belief (Belief), and for further concreteness, the planning algorithm performs a forward search in a Belief tree, each node of the Belief tree representing a Belief, the parent node being connected to the child nodes by behavior-observation branches. However, planning is computationally infeasible in some worst cases, and the approximately simulated POMDP algorithm has been widely used for many tasks such as resource management, unmanned driving, navigation, and robotic arms. However, POMDP planning has any computational difficulty problems due to "dimensional disaster" and "historical information disaster".

Recent online POMDP algorithms such as DESPOT and pompp use monte carlo methods for belief updates and tree searches to deal with dimensional disaster problems. The algorithms represent the beliefs as a sampling state set to overcome the calculation difficulty caused by large state space, and the DESPOT further realizes the fast approximation calculation of the approximation value of different strategies of the current beliefs through sampling behaviors and observation tracks. In POMDP tasks with large state space, the latest online algorithms can compute a near-optimal strategy. Theoretical analysis shows that a small number of sampling sequences can ensure near-optimal online planning. In practical application, the sampling strategy plays an important role in the probabilistic sampling algorithm, and has a great influence on the performance of the algorithm. The DESPOT and POMCP both adopt a random sampling mode to construct a state set, however, the random sampling has blindness and cannot well represent the convergence direction of the whole belief space, so that an approximate optimal strategy calculated based on the state set is likely to be locally approximate optimal.

The motion planning of robots has been developed for a long time, and many different planning methods such as an artificial potential field-based method, a mathematical model-based method, a graph search-based method, a node-based method, a sampling-based method, and the like have been formed. In order to increase the flexibility and maneuverability of the mechanical arm, redundant mechanical arms and even super-redundant mechanical arms are invented, the degree of freedom of the redundant mechanical arms is more than 6, and the complexity of the mechanical arm motion planning is obviously increased along with the increase of the degree of freedom of the mechanical arm, so the motion planning of the redundant mechanical arm usually needs to consider the problem of dimension disaster. Sampling-based methods have gained widespread attention in the field of robotic arm motion planning due to their significant advantages in multi-dimensional space.

A Kernel sampling strategy is introduced on a DESPOT basic framework, and a Kernel-DESPOT online planning algorithm is proposed. The kernel method has been widely used to improve the utility of many well-known algorithms, such as perceptron networks, support vector machines, and natural language processing. Although the state or environmental information of the robot is partially observable for the online POMDP planning algorithm, the state of the robot may be divided into a partial observable component and an observable component. POMCP makes a better strategy for guiding behavior selection based on historical behavior observation information. We will consider the historical observation information as one of the important components of the current robot observable information. The Kernel of the Kernel sampling strategy is to take the information components observable by the robot in the current state as important influence factors for sampling. The correlation between the observable information of the robot and the state information in the belief space is calculated through a kernel function, and the correlation is used as a sampling basis. The blindness of random sampling is avoided by the state set obtained by the Kernel sampling strategy, the belief space can be represented better, and the probability of local optimal problems caused by the state with poor correlation is reduced. The weight of each sampling state is uniformly distributed when the DESPOT constructs the belief tree, but the importance of the state should be different for the current robot as the environment uncertainty gradually decreases. So each sample we weight the state according to the degree of correlation.

Disclosure of Invention

The invention provides an online planning method based on a Kernel sampling strategy in an uncertain environment, which overcomes the defects of the existing reinforcement learning motion planning algorithm, has better algorithm performance than DESPOT and POMCP, and has advantages in convergence speed and quality.

The invention adopts the following technical scheme.

An online planning method based on a Kernel sampling strategy under an uncertain environment is used for planning a robot when a task is executed under the uncertain environment, and in the uncertain environment, uncertainty expressed as a POMDP model is a main factor for restricting the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return; in the online planning method, when an observable part is processed, the state of the robot is expressed as a belief, the belief is recorded as belief, the belief belongs to a state set, and forward search is executed by a POMDP algorithm in a mode of constructing a belief tree so as to obtain an optimal strategy under the current belief; each node of the belief tree represents a belief, and a father node is connected with a child node through a behavior-observation branch;

the POMDP algorithm is an online POMDP planning algorithm Kernel-DESPOT and comprises the following steps;

step S1, in the current belief space b of the robot, sampling K states according to a Kernel sampling strategy to construct a sampling state set phi_bAnd distributing the weight to each state;

s2, constructing a belief tree D by taking b as a root node through a Kernel-DESPOT algorithm;

step S3, initializing the experience value of the current beliefs b of the robot

Upper bound U (b) and lower bound L (b), and RK-WDU optimum value V^*(b) Upper and lower bounds μ (b) and l (b);

step S4, defining the uncertainty of the current belief of the robot as epsilon (b) ← mu (b) -l (b);

step S5, if the uncertainty epsilon (b) is greater than the ideal value oa₀And the total running time of the algorithm is less than T_maxThen to the root node b₀Carrying out expansion;

step S6, when the belief tree stops expanding, executing BACKUP (D, b);

after completion of the BACKUP (D, b), the uncertainty e (b) of the following node is updated, and it is determined again whether the uncertainty is smaller than the oa₀Or whether the running time is greater than T_maxIf the condition is satisfied, the Kernel-DESPOT method returns the value of l (b) of b;

step S7, finally, for the root node b, the algorithm selects an optimal behavior a^*So that the belief tree returns l (b) max, i.e. a^*←max_a∈Al(b,a)；

Optimal behavior a of comparative belief tree computation^*Corresponding value l (b, a)^*) And by default strategy pi₀The size of the initialized value L (b), if L (b) is larger, the optimal behavior is modified to the default policy, i.e. a^*←π₀(b)；

Step S8, the machine repeats the above process until the target point is finally reached.

The specific implementation manner of step S1 is: kernel sampling strategy Kernel function definition

Wherein x represents the observable information of the current robot state, and x_iRepresenting observable information of the state in the belief space, | | x | | is the norm of x,

is a kronecker symbol. K (x, x)_i) Denotes x and x_iSo that it can be based on K (x, x)_i) Sampling K states highly correlated with the current state information; each node b of the Kernel-DESPOT belief tree contains a setPhi (co) phi_bThe set represents all sequences passing through node b. The initial state of each sequence constitutes a sampling state set; for the current belief b, the starting state s of the sequence phi₀Has a weight of

Wherein φ e φ_b，x_iIs a state s₀The observable fraction of information;

definition of σ_n ²To measure the noise variance, K (x, x) representing all states of the belief space in the last sample period_i) Variance of the values; definition of σ_f ²Is the signal variance.

The specific implementation form of step S2 is: by form of calculation of empirical values

Wherein V_π,φRepresenting the computation of each sample sequence by a simulation strategy pi_φ∈ΦbThe total discount rewards of (a). U (b) is determined by assuming that all states are fully observable, i.e., converting the POMDP problem to an MDP problem, and then calculating the optimal value V in the MDP environment_MDP

The lower bound of empirical value is calculated by giving a default policy of π₀For each sequence Φ of each node b in DESPOT_bAnd simulating a default strategy to explore for a limited time step, calculating the total discount rewards of each sequence, and averaging to obtain the total discount rewards.

The upper and lower limits μ (b) and l (b) of RK-WDU can be determined by U (b) and L (b):

where γ is a discounting factor, Δ (b) represents the depth in the node b minus tree π_bIs a sub-tree at b, | π_bI represents pi_bSize, weighting

Empirical estimation of the b-probability of arrival.

The specific implementation form of step S5 is: define b ═ τ (b, a, z) as the child node where b takes action a and obtains observation z arrival. So when performing the spanning tree operation on node b, the values U (b '), L (b '), μ (b '), andl (b ') of all child nodes b ' of all b are initialized first, as shown in step S3; each exploration then aims at reducing the current difference epsilon (b) at the root node b to the target difference ξ epsilon (b), where ξ oa (0,1) is a constant; in the exploration process, the optimal behavior selection of each node b depends on the upper bound μ (b) of b:

at execution of a^*Thereafter, by selecting observation z that maximizes the excess uncertainty^*To obtain a child node b ═ τ (b, a)^*,z)：

The process of expanding the tree is repeated until delta (b) > D represents that the tree is expanded to the maximum depth of the belief tree; or the uncertainty of the node b is reduced to an expected value, namely E (b) < 0, and the continued exploration has no meaning to the node b; there is also a case where the tree expansion stops the forward exploration, i.e. the parent node b of b' already has insufficient sample sequences:

l (b, b ') represents the number of nodes on the path from b to b'; if the parent b sampling sequence is not sufficient, then continuing to expand b increases the number of b 'child policy trees, which may cause overfitting and reduce the effect of b' regularization. If the above equation is satisfied in the process of expanding the tree, a pruning PRUNE (D, b) operation needs to be performed.

The specific implementation form of the step operation PRUNE (D, b) is as follows: after the forward search is stopped because the sampling sequence of the parent node b is not enough, the initial lower bound is respectively assigned to the upper bound for the value calculation of the parent node b, namely, the uncertainty of the current node b is indicated to meet the requirement, namely:

U(b)←L(b)

μ(b)←l(b)

BACKUP (D, b) is also performed thereafter.

The specific implementation manner of the step operation BACKUP (D, b) is as follows: when the BACKUP (D, b) needs to be executed, the Kernel-DESPOT follows the Bellman rule to update the value of the nodes in the belief tree from bottom to top along the path:

where b' is a child of b, b ═ τ (b, a, z).

Compared with the prior art, the invention has the following beneficial effects:

compared with the latest online POMDP planning algorithm DESPOT and POMCP algorithm, the algorithm provided by the invention has the following advantages:

(1) the algorithm proposed by the invention is approximately optimal;

(2) the Kernel of the Kernel sampling strategy is to take the information components observable by the robot in the current state as important influence factors for sampling. The correlation between the observable information of the robot and the state information in the belief space is calculated through a kernel function, and the correlation is used as a sampling basis. The blindness of random sampling is avoided by the state set obtained by the Kernel sampling strategy, the belief space can be represented better, and the probability of local optimal problems caused by the state with poor correlation is reduced;

(3) the weight of each sampling state is uniformly distributed when the DESPOT constructs the belief tree, but the importance of the state should be different for the current robot as the environment uncertainty gradually decreases. Therefore, each time the Kernel-DESPOT is sampled, the weight distribution is carried out on the state according to the degree of correlation;

(4) the Kernel sampling strategy and the weight distribution respectively improve the performance of the algorithm.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of an extended form of the belief tree of the present invention (solid circles represent belief nodes, black squares represent belief-behavior nodes; tree, different solid circles represent states sampled according to a sampling strategy, solid triangles of different sizes represent weights of the states, different trajectories represent sampling sequences corresponding to the sampling states);

fig. 3 is a schematic diagram of four common POMDP evaluation simulation environments (four main environments are (a) tag. robot chases a target to escape, (b) lasertag. robot chases a target in a grid with randomly scattered obstacles, and the robot carries radar for searching the target and finding the obstacle. (c) rocksample. the robot rover can detect rocks to identify and sample "good" rocks.

FIG. 4 is a schematic diagram showing the effect of different sampling sequences on algorithm performance in the present invention ((a) Tag. (b) Lasertag. (c) RockSample.);

FIG. 5 is a graphical representation of the runtime of the present invention after normalization (average runtime comparison of different task algorithms);

FIG. 6 is a schematic diagram of the normalized convergence performance of the present invention (comparison of convergence degrees of different task algorithms);

FIG. 7 is an algorithm performance representation intent of the present invention in a common POMDP rating simulation environment (showing the average discount awards for two algorithms over six different tasks);

FIG. 8 is a diagram showing the effect of the present invention on the performance table of the algorithm (Kernel-Despot contains two main parts: weight assignment for each state in the Sampling state set according to the Kernel Sampling Strategy. weighted means Sampling only according to Kernel without weight assignment, and Kernel Sampling Strategy means Sampling Strategy with random Sampling but weight assignment for the randomly sampled state set).

Detailed Description

As shown in the figure, the on-line planning method based on the Kernel sampling strategy under the uncertain environment is used for planning the robot when a task is executed under the uncertain environment, and in the uncertain environment, the uncertainty expressed as a POMDP model is a main factor for restricting the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return;

in the online planning method, when an observable part is processed, the state of the robot is expressed as a belief, the belief is recorded as belief, the belief belongs to a state set, and forward search is executed by a POMDP algorithm in a mode of constructing a belief tree so as to obtain an optimal strategy under the current belief; each node of the belief tree represents a belief, and a father node is connected with a child node through a behavior-observation branch;

step S1, in the current belief space b of the robotIn the method, K states are sampled according to a Kernel sampling strategy to construct a sampling state set phi_bAnd distributing the weight to each state;

step S6, when the belief tree stops expanding, executing BACKUP (D, b);

is a kronecker symbol. K (x, x)_i) Denotes x and x_iSo that it can be based on K (x, x)_i) Sampling K states highly correlated with the current state information; each node b of the Kernel-DESPOT belief tree contains a set phi_bThe set represents all sequences passing through node b. The initial state of each sequence constitutes a sampling state set; for the current belief b, the starting state s of the sequence phi₀Has a weight of

Wherein φ e φ_b，x_iIs a state s₀The observable fraction of information;

Empirical estimation of the b-probability of arrival.

U(b)←L(b)

μ(b)←l(b)

BACKUP (D, b) is also performed thereafter.

where b' is a child of b, b ═ τ (b, a, z).

Example (b):

in this embodiment, the complete pseudo code of the algorithm is specifically as follows:

the embodiment of the invention is explained in detail by using specific experiments, the invention provides a planning method under an uncertain environment based on a Kernel sampling strategy, and for POMDP planning problems, some standard POMDP simulation evaluation environments exist, so the performance of the Kernel-DESPOT algorithm is mainly tested through a common simulation evaluation environment and compared with the latest POMDP algorithm. The specific experimental setup was as follows:

simulation experiment:

the simulation experiment was performed in the Ubuntu system.

We evaluated the performance of the Kernel-DESPOT algorithm (table 1) in six simulation tasks, which are common evaluation criteria for evaluating the POMDP algorithm. Setting a sampling sequence K to be 500 in an environment Tag, a Lasertag and a Rocksample, setting K to be 100 in an environment Pocman, and setting a calculation method of an initial upper boundary and a calculation method of an initial lower boundary according to different environment specific requirements as a heuristic. All algorithms were simulated on a unified platform and the maximum run time of the online POMDP algorithm was set to 1 second. Kernel-DESPOT consists essentially of two parts: kernel-based correlation sampling and weight reassignment of the sampling state set. We show the effect of these two parts on the performance of the algorithm separately by table 2. To further illustrate the performance of the Kernel-DESPOT algorithm, we look from fig. 4: the invention is a schematic diagram of the influence of different sampling sequences on algorithm performance, and fig. 5: is a schematic diagram of the invention after normalization of the run time, fig. 6: the three angles are schematic diagrams after the convergence performance of the invention is normalized, and the advantages of Kernel-DESPOT are explained. According to

It can be seen that the variable x, i.e. the information available for the current robot state, is the key to compute the kernel function. For each different ROMDP environment, the information available at the current state of the robot is different, so the variable x is dependent on the characteristics of the environment. But we all convert the information available from the current state of the robot into a positional relationship, so x is actually a position vector. The position vector is then used for correlation calculations with states in the belief space.

(1) Simulation environment one: TAG

Tag is the standard POMDP benchmark set forth by Pineau et al in 2003. One robot and one object walk in 29 possible grid positions (fig. 3 a). The goal of the robot is to find and grasp an object that always wants to escape the robot. Their inspired positions are random, the robot knows its own position, but only observes the target at the same position as the target. The robot may be stationary or move to four adjacent positions, each of which would consume a reward of-1. In addition, the robot can perform a "tag" action, such as the robot obtains a +10 reward if the robot is in the same location as the target, otherwise the reward value is-10.

In the Tag environment, since the observation value of the robot only contains the position information of itself, we consider the role of the historical observation information when constructing the vector x. The current robot position and the robot position in the historical observation information form a relative position vector x, along with the increase of the historical observation information, only the historical information in three time steps is considered in order to reduce the calculated amount, and if the current robot position has great deviation from the position vector x constructed by the historical information in the three time steps, the information close to the time steps is taken as the standard. The above method is mainly to avoid the robot repeatedly searching the same area in a short time.

(2) Simulation scene two: LaserTag

LaserTag is an extended version of the Tag environment and is characterized by a large observation space. In the LaserTag environment, the robot moves in a 7 × 11 rectangular grid and randomly places 8 obstacles (fig. 3 b). The robot and the target object keep the same behavior, except that the robot does not know the position of the robot exactly, and the position information is uniformly distributed in the grid at the beginning. In order to realize positioning, the robot carries a radar to measure the distance in 8 directions. The radar readings in each direction are a standard positive-error distribution centered at the true distance of the robot from the nearest obstacle in that direction, with a standard variance of 2.5.

In the LaserTag environment, we construct the vector x from the observations of the radar. The robot can recognize the closest distance to the obstacle or target in eight directions by radar, but this closest distance is presented in a standard positive distribution. So when the target object appears in the eight directions observed by the current robot radar, we construct a relative position vector x with the closest distance where the probability of the direction in which the target object is located is the greatest. And then, calculating the similarity between the belief space state and the current robot state according to the kernel function to construct a sampling state set. Even if the robot does not know the current position and its observation is in error, we can still approximate a relative position vector x of the target and the robot. Although the position of the target object is judged to have an error with the position of the real target object according to the observation result, the method provides important approximate direction information of the target object, and therefore the sampling is avoided from being in a state of deviating from the direction to a large extent.

(3) A third simulation environment: RockSample

RockSample (Smith & Simmons, 2004) is a maturity assessment criterion with a large state space (fig. 3 c). RS (n, k), indicating that the robotic rover seeks in an n × n grid containing k stones, which may be "good" or "bad". The purpose of the robot is to search for these stones and sample the ones that are good, exiting from the far right of the grid when all searches are completed. At each step, the actions that the robot can take include: move to the adjacent grid, sense the quality of a stone, and sample a stone. When performing the sampling action, a prize of +10 is obtained if the stone is good, otherwise the prize value is-10. The act of performing the movement or perception does not have any reward, the movement or sampling act does not produce an observation, only the perception act produces an observation: good or bad, the accuracy of the observation decreases exponentially with the distance of the robot from the stone. To obtain a higher reward, the robot navigates through the grid and perceives the quality of the stone, while at the same time accessing and sampling the good stone with the information obtained.

In a RockSample environment, the idea of constructing a position vector x is simple, that is, if the last behavior is perception and the obtained observation value is good, the position vector x is constructed by using the position of the stone and the position of the current robot, and then a state with high correlation with x is sampled in a belief space. The position vector will remain unchanged until the next time that the perceptual behavior is performed. If the robot continuously executes the perception behaviors and the obtained observation values are all good, the position vector x is constructed by taking the observation with the highest accuracy as a basis. The core idea of the method is that the robot believes the observation result, the observation information is used for guiding sampling, states with low similarity with the observation information are regarded as useless states, and the states do not play a positive role in strategy calculation and can cause a local optimal problem.

(4) And (4) simulating an environment: pocman

Pocman(Silver&Veness, 2010) a part of a popular video game of the same name observable variant (FIG. 3 d). The agent and four sprites walk in a 17 x 19 maze with food distributed in the maze. Each move of the agent will have a penalty of-1, with each food representing a prize of +10, if the agent is terminated by the sprite capture game and a penalty of-100 is incurred. In addition, four strong foods exist in the maze, and the agent can catch the genius within 15 steps after eating the strong foods, so that the agent can obtain a reward of +25 when successfully catching the genius. The Manhattan distance between the genius and the intelligent agent is within 5, and the genius can catch the intelligent agent; but if the agent eats a strong food, the sprite will be far away from the agent. The agent does not know the specific location of the puck, but can obtain the following information: whether the main direction of the intelligent body has the genius, whether the Manhattan distance is 2, whether the main direction of the intelligent body has the wall, and whether the adjacent or diagonal adjacent position has food. Pocman has a very large state space, close to 10⁵⁶And (4) a state.

In a Pocman environment, the construction of the position vector x is easy, since the agent can know whether there is food in a neighboring or diagonally neighboring position. If there is food in an adjacent or diagonally adjacent location, then a vector x is constructed with the location of the food and the current agent location, and the adjacent location is taken as priority. According to the method, the blindness of intelligent agent walking caused by random sampling can be effectively avoided.

(5) The main contribution of Kernel Sample Strategy and StateReweighing to algorithm performance impact comparison with Kernel-DESPOT is to propose Kernel function-based

And weight assignment of the sampling state set. To better illustrate the impact of these two components on algorithm performance, we tested the average discount reward in six environments according to the control variable method. If the observation space of a POMDP environment is larger, the corresponding belief tree scale is also larger, the overfitting phenomenon caused by random sampling in the environment is more obvious, and the overfitting phenomenon is shown in that the robot is easy to fall into local optimum. The Kernel-DESPOT utilizes the similarity between the belief space state and the current state as a sampling basis to construct a sampling state set, and distributes sampling state weights according to the similarity. We expect that the local optimality problem caused by overfitting can be improved according to the above method. As can be seen from Table 2, in the Tag, LaserTag and Pocman environments with larger observation spaces, the only basis is

The performance of the algorithm is remarkably improved by sampling or randomly sampling but performing weight distribution on a sampling state set, and the superiority of Kernel-DESPOT is gradually highlighted along with the increase of the number of states in a RockSample environment.

(6) Kernel-DESPOT runtime and convergence Performance comparison

For each node b, assume an initial upper bound U₀Is approximated by an error δ:

if the error δ of the initial upper bound is 0, it is

The true upper bound of (c).

Let T be_maxThe method is limited, and a partial belief tree DESPOT is constructed by a real-time DESPOT algorithm

Root node b₀The difference between the upper and lower bounds is epsilon (b)₀). Then originate from

Regularization optimization strategy of

Satisfies the following conditions:

v^*(b₀) The regularization optimal strategy is calculated by the complete Kernel-DESPOT belief tree. Epsilon (b)₀) With T_maxIncrease and monotonically decrease, and thus

And gradually approaches the optimal regularization strategy with the increase of time. And the initial upper bound error affects the final result by a maximum of δ.

The sampling state set obtained according to Kernel-DESPOT is more consistent with the convergence direction of the belief space, so that epsilon (b) is within a limited time₀) The more rapid the decrease is, the faster it is,

the method can converge to the optimal strategy more quickly, and the Kernel-DESPOT is faster in overall calculation time. Fig. 5 shows a comparison of the running times of different algorithms of the same task, and the ordinate represents the proportion of the running times of the different algorithms of the same task after the normalization process. Overall, the computation speed of the Kernel-DESPOT algorithm is advantageous. Since Kernel-DESPOT is an online POMDP planning algorithm, the maximum computation time of each time step of the algorithm is 1s, so that the algorithm can hardly reduce the uncertainty of the current belief to 0 in a limited time, that is, generally, epsilon (b)₀) Is greater than 0. And epsilon (b)₀) Reflects the convergence of the algorithm, so we average epsilon (b) in each time step₀) Value as a measure of algorithm convergenceThe energy index is shown in FIG. 6. Kernel-DESPOT and convergence degree in different tasks show that the ordinate represents the average epsilon of different algorithms of the same task after normalization (b 0)₎Ratio of occupation.

Claims

1. An on-line planning method based on a Kernel sampling strategy under an uncertain environment is used for planning a robot when a task is executed under the uncertain environment, and is characterized in that: in the uncertainty environment, the uncertainty expressed as the POMDP model is the main factor that restricts the reliable operation of the robot; in the POMDP model, the robot can observe partial states of the robot, and the robot continuously interacts with the environment to obtain a strategy with the maximum return;

step S5, if the uncertainty ε (b) is largeAt the ideal value

And the total running time of the algorithm is less than T_maxThen to the root node b₀Carrying out expansion;

step S6, when the belief tree stops expanding, executing BACKUP (D, b);

after completion of the BACKUP (D, b), the uncertainty e (b) of the root node is updated, and it is determined again whether the uncertainty is smaller than oa₀Or whether the running time is greater than T_maxIf the condition is satisfied, the Kernel-DESPOT algorithm returns the value of l (b) for b;

Step S8, repeating the above steps by the machine until the target point is finally reached;

Wherein the content of the first and second substances,

what is represented in the kernel function is the transpose of the vector; x represents the observable information of the current robot state, x_iRepresenting observable information of the state in the belief space, | | x | | is the norm of x,

is a kronecker symbol; k (x, x)_i) Denotes x and x_iSo that it can be based on K (x, x)_i) Sampling K states highly correlated with the current state information; each node b of the Kernel-DESPOT belief tree contains a set phi_bThe set represents all sequences passing through node b; the initial state of each sequence constitutes a sampling state set; for the current belief b, the starting state s of the sequence phi₀Has a weight of

Wherein φ e φ_b，x_iIs a state s₀The observable fraction of information;

2. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 1 is characterized in that: the specific implementation form of step S3 is: by form of calculation of empirical values

Wherein V_π,φMeans that each sampling sequence phi epsilon is calculated by a simulation strategy pi_bThe total discount rewards of (1); thus, it is possible to provide

Expressing the experience value obtained by executing the strategy pi at the current belief b; u (b) is determined by assuming that all states are fully observable, i.e., converting the POMDP problem to an MDP problem, and then calculating the optimal value V in the MDP environment_MDP

The lower bound of empirical value is calculated by giving a default policy of π₀For each sequence Φ of each node b in DESPOT_bSimulating a default strategy to explore time steps for a limited number of times, calculating the total discount reward of each sequence, averaging to obtain s_φIs the starting state s in the sequence phi₀；

where γ is a discounting factor, Δ (b) represents the depth in the node b minus tree π, weighted

The empirical estimation of the probability of b is achieved, lambda is more than 0, the regularization term is used, the overfitting problem in the belief tree construction process can be effectively avoided,

for the core weight discount effect, formula four and formula five are the embodiment of the RK-WDU idea in the algorithm.

3. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 2 is characterized in that: the specific implementation form of step S5 is: defining b ═ τ (b, a, z) as the child node where b takes action a and obtains observation z arrival; so when performing the spanning tree operation on node b, the values U (b '), L (b '), μ (b '), and L (b ') of all child nodes b ' of all b are initialized first, as shown in step S3; each exploration then aims at reducing the current difference epsilon (b) at the root node b to the target difference ξ epsilon (b), where ξ oa (0,1) is a constant;

in the exploration process, the optimal behavior selection of each node b depends on the upper bound μ (b) of b:

rho (b, a) represents a construction process for introducing a regularization term into the belief tree; at execution of a^*Thereafter, by selecting observation z that maximizes the excess uncertainty^*To obtain a child node b ═ τ (b, a)^*,z)：

l (b, b ') represents the number of nodes on the path from b to b'; if the parent node b sampling sequence is not enough, then continuing to expand b increases the number of b 'sub-policy trees, which may cause overfitting and reduce the effect of b' regularization; if the above equation is satisfied in the process of expanding the tree, a pruning PRUNE (D, b) operation needs to be performed.

4. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 3 is characterized in that: the specific implementation form of the step operation PRUNE (D, b) is as follows: after the forward search is stopped because the sampling sequence of the parent node b is not enough, the initial lower bound is respectively assigned to the upper bound for the value calculation of the parent node b, namely, the uncertainty of the current node b is indicated to meet the requirement, namely:

U(b)←L(b)

μ(b)←l(b)

BACKUP (D, b) is also performed thereafter.

5. The on-line planning method based on the Kernel sampling strategy under the uncertainty environment of claim 4 is characterized in that: the specific implementation manner of the step operation BACKUP (D, b) is as follows: when the BACKUP (D, b) needs to be executed, the Kernel-DESPOT follows the Bellman rule to update the value of the nodes in the belief tree from bottom to top along the path:

where b' is a child of b, b ═ τ (b, a, z).