CN110378439B

CN110378439B - Single robot path planning method based on Q-Learning algorithm

Info

Publication number: CN110378439B
Application number: CN201910737476.6A
Authority: CN
Inventors: 李波; 易洁; 梁宏斌
Original assignee: Chongqing University of Technology
Current assignee: Chongqing University of Technology
Priority date: 2019-08-09
Filing date: 2019-08-09
Publication date: 2021-03-30
Anticipated expiration: 2039-08-09
Also published as: CN110378439A

Abstract

The invention relates to the technical field of path planning of path robots, in particular to a single robot path planning method based on a Q-Learning algorithm, which comprises the following steps: initializing parameters of an algorithm; selecting an action instruction, and calculating and generating an operation state parameter and a reward function according to the action instruction; if the running state parameter is equal to the terminating state parameter and equal to the target state parameter, storing the successful path into a successful path storage table; otherwise, when the updating starting time is less than or equal to the current time and the access times of the state-action pairs are equal to the maximum counting threshold value, updating the action value function and storing the running state parameters into the success path; repeating the steps until the maximum iteration times is reached; and repeating the action instruction selection and the state parameter generation according to the action value function to obtain the optimal path of the single robot. The method can better improve the updating Learning speed and the path planning effect of the Learning system when the Q-Learning algorithm is used for path planning of the single robot.

Description

Single robot path planning method based on Q-Learning algorithm

Technical Field

The invention relates to the technical field of robot path planning, in particular to a single robot path planning method based on a Q-Learning algorithm.

Background

The mobile robot has a wide range of applications, for example, the figure of the mobile robot is found in various fields such as home, agriculture, industry, military and the like. Three major cores in the research field of controlling the movement of robots are the positioning of the robots, the assignment of tasks, and path planning techniques. The path planning is a primary condition for the mobile robot to reach a task target and complete task content. For example: the household service type cleaning robot needs to carry out reasonable path planning on the indoor environment to complete the cleaning task; the agricultural picking robot needs path planning to pass through the crops to complete picking tasks; industrial robots also require path planning to accomplish a given task in a shared workspace.

The single robot system is widely applied to the aspects of family service, agricultural assistance, industrial environment, military assistance and the like. In these applications, the single robot system path planning is particularly important, and the single robot system path planning refers to: finding a path from the initial state to the target state in the working environment to avoid all obstacles requires machine learning, and in the related art, the most common learning method is reinforcement learning.

The Q-Learning algorithm is an important algorithm in reinforcement Learning, and in the related art, the Q-Learning algorithm is applied to path planning of a robot system. The Learning process of the Q-Learning algorithm is an iterative process that requires a gradual update of the Q-value table (action value function) through constant trial and error and action selection. Namely: setting a reward function, selecting an action instruction by the robot according to an epsilon-greedy strategy (epsilon-greedy, epsilon is an exploration factor (epsilon is more than or equal to 0 and less than or equal to 1)), executing the action instruction, updating a Q value table according to the reward function, then generating a state parameter, selecting a next action according to the state parameter and the epsilon-greedy strategy, then continuously executing the action instruction and updating the Q value table until a final Q value table is obtained through updating, and finally obtaining an optimal path according to the Q value table.

The existing Q-Learning algorithm needs to continuously select and update the Q value to improve the selection of the action command, that is, the Q value needs to be updated every time the environment is explored, which results in slow updating and Learning speed of the Learning system. In the related art, in order to ensure the updating and learning speed, the value of the exploration factor epsilon is continuously reduced as the training time of the algorithm is increased, namely, the optimal action is performed more, so that the optimal action is converged into the corresponding solution quickly. The cost of this approach is: the learning system misses the optimal solution due to insufficient environment exploration (the value of the exploration factor epsilon is too small), and can only converge to a suboptimal solution, and more likely converge to a common solution, and the defect can affect the effect of path planning.

Disclosure of Invention

Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: when the Q-Learning algorithm is used for path planning of a single robot, the updating Learning speed and the path planning effect of the Learning system are improved better.

In order to solve the technical problems, the invention adopts the following technical scheme:

the single robot path planning method based on the Q-Learning algorithm comprises the following steps:

s1: initializing an exploration factor, the maximum iteration number, a termination state parameter, a target state parameter, a maximum counting threshold value, an update starting time, an iteration number, a current time, an action value function, the access number of a state-action pair, a successful path and a successful path storage table of the single robot system;

s2: judging whether the iteration times are larger than the maximum iteration times, if so: step S6 is executed; if not: initializing the current state parameters and executing the next step;

s3: generating a random number, comparing the random number with the exploration factor, selecting an action instruction, and calculating and generating an operation state parameter and a reward function after the robot executes the action instruction according to the action instruction; then, judging whether the running state parameter is equal to the termination state parameter, if so: continuing to judge whether the running state parameter is equal to the target state parameter, if so, storing the successful path into a successful path storage table, executing the self-adding of the iteration times, returning to the step S2, otherwise, executing the self-adding of the iteration times, and returning to the step S2; if not: executing the next step;

s4: judging whether the updating starting time is less than or equal to the current time, if so: storing a reward function, executing the access times of the state-action pair to be self-increased by one, and executing the next step; if not: judging whether the access times of the state-action pairs are equal to a maximum counting threshold value or not, if so, updating an action value function, executing the next step, and if not, executing the next step;

s5: storing the running state parameter into the success path, performing self-addition of one at the current moment, and returning to the step S3;

s6: acquiring an action value function, selecting an action instruction from the action value function according to a preset initial state parameter, and repeating: and executing the action command to generate a state parameter, selecting the action command according to the state parameter, and obtaining the optimal path of the single robot system when the generated state parameter is equal to a preset target state parameter.

In this way, whether to update the action value function (Q value) is determined by setting a maximum count threshold in the Q-Learning algorithm and comparing the number of accesses of the state-action pair with the maximum count threshold; namely: the update action value function is started when the number of accesses to a state-action pair reaches a maximum count threshold. Firstly, the exploration on the environment cannot be reduced in the scheme, and the path planning effect can be ensured; secondly, the scheme not only reduces the calculated amount of the system, but also greatly improves the updating and learning speed; thirdly, the action value function updating mode of the scheme has multi-step forecasting capability, the influence of a plurality of state-action pairs on the action value function in the future is considered, and the control strategy obtained by learning is more reasonable; finally, in the scheme, the access times of the state-action pairs are selected as the basis for updating the action value function, so that the preorder steps are not influenced on the premise of ensuring the updating learning speed, namely, the value of the exploration factor is not required to be reduced, and the problem that the learning system misses the optimal solution due to insufficient exploration completeness of the environment can be avoided.

Therefore, when the Q-Learning algorithm is used for path planning of the single robot, the updating Learning speed and the path planning effect of the Learning system can be better improved.

Preferably, in step S4, the action value function is updated according to the formula Q (S, a) ═ U (S, a)/h, where Q (S, a) is the action value function, U (S, a) is the stored reward function, and h is the maximum count threshold.

Therefore, the updated action value function is the average value of all stored reward functions, on one hand, the calculated amount can be reduced through an average calculation method, and the accuracy of the action value function is improved; on the other hand, the calculation mode does not need to calculate the qualification trace matrix any more, and the complexity of calculation is further reduced.

Preferably, in step S3, if the operation status parameter is equal to the end status parameter and equal to the target status parameter, a preset number of successful way-finding times is added by one before step S2 is executed.

Therefore, the Learning effect of the Learning system can be fed back in time by recording and updating the successful path searching times, and the updating of the decision searching factor can be assisted, which is beneficial to assisting in solving the problem of the balance between searching and utilization of the Q-Learning algorithm.

Preferably, in step S2, if the iteration count is less than the maximum iteration count, it is first determined whether the successful route searching count is greater than a pre-stored maximum successful route searching count, and if so: updating the value of the exploration factor and executing the next step; if not: the next step is performed.

Therefore, the learning effect of the learning system can be fed back through the successful path finding times, and the value of the exploration factor is continuously updated through the learning effect, so that the epsilon-greedy strategy has stronger adaptability and can better accord with the operation rule.

Preferably, in step S3, if the running state parameter is equal to the ending state parameter and equal to the target state parameter, the number of successful paths is incremented by one before step S2 is executed.

Therefore, the Learning effect of the Learning system can be fed back in time by recording and updating the successful path searching times, and the follow-up searching factor can be determined to be updated, which is also beneficial to assisting in solving the problem of the balance between searching and utilization of the Q-Learning algorithm.

Preferably, in step S2, when the value of the search factor is updated, it is determined whether the number of successful paths is smaller than the pre-stored minimum number of paths, and if so: executing epsilon ' ═ epsilon + eSize x (Minpathnum-PathNum), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, Minpathnum is the minimum path number, and PathNum is the successful path number; if not: executing epsilon ' ═ eSize × (i/eccycle), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, i is the iteration number, and eccycle is the change period of the stored exploration factor.

Therefore, the learning effect of the learning system can be fed back more accurately and timely through the combination of the successful path number and the successful path finding times, so that the value of the exploration factor is continuously updated through the learning effect, the epsilon-greedy strategy has stronger adaptability, and the operation rule can be better met.

Preferably, in step S3, when comparing the random number with the search factor, if the random number is greater than the search factor, selecting an action command according to a pre-stored probability model; and if the random number is less than or equal to the exploration factor, randomly selecting an action command from a pre-stored action set.

Therefore, the action selection strategy based on the probability is realized through the comparison of the random number and the exploration factor and the probability model, the probability that the action with a larger value of the action value function is selected is higher, and the problem that the selection of the maximum value has a deviation value due to the influence of noise is solved.

Preferably, in step S3, the probabilistic model selects the formula of the action command as

Wherein P (s | a)_k) Selecting action command a under selected state parameter S_kProbability of (a), Q (s, a)_k) Is an action command a under the state parameter S_kThe value of Q of (A) is,

the sum of the Q values of all the action commands in the state parameter S.

Therefore, the probability model can make the probability of selecting the action with a larger value of the action function value larger through pre-training and learning, and is beneficial to solving the problem of maximization deviation.

Drawings

For purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made in detail to the present invention as illustrated in the accompanying drawings, in which:

FIG. 1 is a logic diagram of a single robot path planning method based on a Q-Learning algorithm according to an embodiment;

FIG. 2 is a flowchart illustrating a method for planning a path of a single robot based on a Q-Learning algorithm according to an embodiment;

FIG. 3 is a flowchart illustrating a method for updating a Q value table according to an embodiment;

FIG. 4 is a schematic diagram of a path planned by a conventional Q-Learning algorithm in experiment one of the second embodiment;

FIG. 5 is a schematic diagram of the path planned by the improved Q-Learning algorithm of the first experiment of the second embodiment of the present invention;

FIG. 6 is a line graph of the time taken for the conventional Q-Learning algorithm to converge in experiment one of example two;

FIG. 7 is a line graph of the time taken for the improved Q-Learning algorithm to converge in experiment one of example two;

FIG. 8 is a schematic diagram of a path planned by the improved Q-Learning algorithm in experiment two of the second embodiment;

FIG. 9 is a line graph of the number of training times required for the convergence of the conventional Q-Learning algorithm in experiment two of example two;

FIG. 10 is a line graph of the number of training passes required for convergence of the improved Q-Learning algorithm in experiment two of example two.

Detailed Description

The following is further detailed by the specific embodiments:

the Q-Learning algorithm was proposed by Watkins in 1989, and is an important algorithm in the reinforcement Learning algorithm.

Updating rule of one, Q-Learning algorithm

The robot under the Q-Learning algorithm does not know the overall environment, only knows the selectable action set in the current state, and usually needs to construct an immediate reward return matrix R for representing the action reward value from the state s to the next state s'. And constructing and calculating a Q value table (or called a Q matrix) for guiding the action of the robot through R.

Thus, each state-action pair is set as < S, a >, and the Q learning algorithm estimates the value function Q (S, a) of the state-action pair to obtain the control strategy. One of the simplest forms of Q learning is single step Q learning, in which the Q value is modified as follows:

equation (1-1) can only hold if an optimal strategy is used, and at the beginning of learning, equation (1-1) is not equal on both sides, and its error is:

the obtained update rule is as follows:

Q_t(s_t,a_t)←Q_t(s_t,a_t)+αΔQ_t(s_t,a_t) (1-3)

namely:

wherein: s_t: a current state; a is_t: an action selected in the current state; s_t+1: performing action a_tThe next state;

r_t+1: performing action a_tLater instant awards; q(s)_t,a_t): is a state s_tLower robot performs action a_tThe resulting cumulative weighted reward is then a function of the value of the state-action pair.

α: controlling the learning rate of convergence, wherein alpha is more than 0 and less than 1, and gradually approaching the Q value to the optimal value by continuously trying to search a space;

γ: the discount factor, γ ≦ 0 ≦ γ < 1, γ ≦ 0 indicates an instant prize, γ tends to 1 indicates a future prize, that is, γ determines the degree of influence of the time on the prize, and indicates the degree of sacrificing the current prize in exchange for the long-term benefit.

Step of two, Q-Learning algorithm

Firstly, initializing a Q (s, a) value, in a state s, selecting an action a by the robot according to an action selection strategy pi to obtain a next state s' and an incentive value gamma, and then correcting the Q (s, a) value according to an updating rule; the action selection and the correction of the Q (s, a) value are repeated until the learning is finished.

A typical Q-Learning algorithm is completed as follows:

convergence of the three, Q-Learning algorithm

After the following four convergence conditions are satisfied, Q (s, a) can converge on Q with a hundred percent probability^*(s,a)：

1) The environment has the property of Markov decision process;

2) expressing the value function by using a lookup table, namely using the table to store Q (s, a) values (Q matrix);

3) each state-action pair < S, A > can be iterated indefinitely with Q (S, a) update formula;

4) a reasonable learning rate alpha.

Fourthly, the balance problem of exploration and utilization

Exploration-utilization balance is a very fundamental concept in reinforcement learning, where each time a selection is made, when to select the best action that has already been explored, and when to explore to try an unknown action, which is the exploration-utilization balance problem.

The epsilon greedy strategy (epsilon-greedy) is the most commonly used method for solving the exploration-utilization balance problem, and is also the exploration strategy mu adopted in the Q-Learning algorithm. The epsilon-greedy strategy formula is as follows:

wherein, epsilon: searching factors (epsilon is more than or equal to 0 and less than or equal to 1);

σ: the algorithm randomly generates a number between 0 and 1 in each step.

As can be seen from the formula (1-5), when the exploration factor epsilon is larger, the learning system is biased to explore the environment and try random actions; when the exploration factor epsilon is small, the learning system then tends to choose to perform the best known action. Therefore, the choice of the value of epsilon is very important.

Defects of five and Q-Learning algorithm

Through analysis of the existing Q-Learning algorithm, the following problems are found when the Q-Learning algorithm is applied to path planning of a single robot:

1) balancing the problem of exploration and exploitation

In order to ensure the convergence speed of the existing Q-Learning algorithm, the value of the search factor epsilon is usually continuously decreased with the increase of the training time of the algorithm, i.e. more optimal actions are performed to quickly converge into the corresponding solution.

However, the learning system may miss the optimal solution and converge to a sub-optimal solution, or more likely to converge to a normal solution, because the environment is not fully explored.

2) Maximum deviation problem

In the Q-Learning algorithm, the strategy updating method is that

Where the max method is used, i.e. when updating the policy, one action a is chosen constantly with the largest Q (s, a). However, selecting this method may affect the final result due to some noise terms, and there is a maximum bias problem.

Although repeatedly applying the action of Q (s, a) max results in an action strategy that maximizes the jackpot, the performance of the Q-Learning algorithm becomes increasingly worse when Q (s, a) is not that accurate. And when learning continuously adopts the action of the maximum Q (s, a), the action is probably related to the state action of high Q (s, a) value in the early training period, so that the learning converges too fast, and some possible optimal strategies are missed.

3) Slow update speed problem

The Learning process of the Q-Learning algorithm is an iterative process, which requires a gradual improvement of the mapping strategy from state to action through continuous trial and error and action selection, which requires the Learning system to perform multiple trial and error and correction on the feedback information for each possible state action to obtain a more suitable control strategy.

Aiming at the problems, the invention provides a single robot path planning method based on a Q-Learning algorithm, which comprises the following steps:

The first embodiment is as follows:

as shown in fig. 1: the single robot path planning method based on the Q-Learning algorithm comprises the following steps:

s1: the method comprises the steps of initializing an action set A, a state set S, a maximum iteration number n, a maximum searching step number m, a minimum path number MinPathNum, a maximum successful path searching number MaxSuccessNum, a searching factor epsilon, a searching factor single updating step size eSize, a searching factor changing period eCycle, a maximum counting threshold h, an updating starting time B (S, a), an updating finishing time, an action value function Q (S, a), an access number C (S, a) of a state action pair, a reward function storage U (S, a), a successful path searching number SuccessNum, a successful path number PathNum, a successful path PathList, a successful path storage List, an iteration number i, a current time t and a target state parameter.

S2: judging whether the iteration times i are larger than the maximum iteration times n, if so: step S6 is executed; if not: judging whether the successful path searching times Successfum are greater than the prestored maximum successful path searching times MaxSuccessfum or not, if so, judging that: updating the value of the exploration factor epsilon and executing the next step; if not: the next step is performed.

When the value of the exploration factor is updated, whether the successful path number PathNum is smaller than the prestored minimum path number MinPathNum is judged, if yes, the following steps are carried out: executing epsilon ' ═ epsilon + eSize x (Minpathnum-PathNum), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, Minpathnum is the minimum path number, and PathNum is the successful path number; if not: executing epsilon ' ═ eSize × (i/eccycle), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, i is the iteration number, and eccycle is the change period of the stored exploration factor.

S3: generating a random number sigma belonged to (0, 1), comparing the value of the random number sigma with the value of the exploration factor epsilon, and selecting an action command a_tAccording to the action command a_tCalculating and generating an operation state parameter s after the robot executes the action command_t+1And a reward function r_t+1(ii) a Judging an operating state parameter s_t+1Whether it is equal to the termination state parameter, if so: judging the operating state parameter s_t+1If the number of successful path numbers is not equal to the target state parameter, executing step S2, executing step S2, and executing step S; if not: the next step is performed.

If the value of the random number sigma is larger than the exploration factor epsilon, selecting an action a according to a prestored probability model_t(ii) a If the value of the random number sigma is less than or equal to the exploration factor epsilon, randomly selecting the action from the action set Aa_t(ii) a Selecting action command a by probability model_tIs of the formula

the sum of the Q values of all the action commands in the state parameter S.

S4: judging whether the updating start time B (s, a) is less than or equal to the current time t, if so: then the reward function r_t+1Storing in a reward function storage U (s, a), performing a self-increment of the number of accesses C (s, a) of the state-action pair, and performing the next step; if not: judging whether the access times C (s, a) of the state-action pairs are equal to a maximum counting threshold h, if so, updating an action value function Q (s, a), and executing the next step, if not: the next step is performed.

The formula for updating the motion value function is Q (s, a) ═ U (s, a)/h, where Q (s, a) is the motion value function, U (s, a) is the stored reward function, and h is the maximum count threshold.

S5: running state parameter s_t+1Stores the path into the success path PathList, performs self-addition of one at the current time t, and performs step S3.

In order to better introduce the path planning process, the embodiment also discloses a flow chart of the single robot path planning method based on the Q-Learning algorithm.

As shown in fig. 2 and 3: the single robot path planning process based on the Q-Learning algorithm comprises the following steps:

the method comprises the following steps: initializing an action value function Q (S, a), an action set A, a state set S, a maximum iteration number n, a maximum searching step number m, a minimum path number MinPathNum, a maximum successful path searching number MaxSuccessfum, a searching factor epsilon, a searching factor single updating step size eSize, a searching factor changing period eCycle, a state action pair accessed number C (S, a), a starting updating time B (S, a), a finishing updating time E (S, a) reward function storage U (S, a), whether L (S, a) is learned or not, a maximum counting threshold h, a successful path searching number Successfum, a successful path number PathNum, a successful path PathList, a successful path storage List, an iteration number i and a current time t.

Wherein Q (s, a) is 0, C (s, a) is 0, U (s, a) is 0, successum is 0, PathNum is 0, PathList is 0, List is 0, i is 1, and t is 1.

Step two: judging whether i is larger than n, if so: then the learning is ended; if not: if yes, executing step S3, if yes, executing step S, and if no, executing step S, and if yes, executing step S, and otherwise, executing step S.

Wherein, when updating the value of the search factor epsilon: if PathNum is less than the MinPathNum, then the formula ε + eSize x (MinPathNum-PathNyn) is adopted; if PathNum is greater than or equal to MinPathNum, adopting a formula epsilon-eSize x (i/eCycle); in the formula, epsilon is an exploration factor, eZize is a single updating step length of the exploration factor, MinPathNum is the minimum path number, PathNum is the successful path number, i is the iteration number, and eCycle is an exploration factor changing period.

Step three: the state S is initialized, S ∈ S.

Step four: judging whether t is larger than m, if yes: executing i +1 and returning to the step two; if not: generating a random number sigma epsilon (0, 1), judging whether sigma is greater than epsilon, if so, selecting the state s according to a probability expression_tAct a of execution_tIf not, then randomly selecting action a_t，a_t∈A。

Wherein the action a is selected according to a probabilistic formula_tThe formula of (1) is:

the sum of the Q values of all the action commands in the state parameter S.

Step five: performing action a_tObtain the state s_t+1And a prize r_t+1。

Step six: judging the state s_t+1Whether the state is a termination state or not is judged, if so, the method comprises the following steps: then re-determine the state s_t+1If it is the target state, if it is the state s_t+1If the target state is reached, executing the following operation that after adding one to the SuccessNum value, judging that the PathList is not in the List at the moment, if the PathList is not in the List, adding the PathList into the List, adding one to the PathNum value, adding one to the value of i, executing the step two, if the target state is not reached, adding one to the value of i, and returning to the step two; if not: step seven is performed.

Step seven: judging whether B (s, a) is less than or equal to t (namely when the updating time of the last action value function Q (s, a) is before the step), if so: l (s, a) ═ true, even if it does learning; if not, executing step eight.

Step eight: determining whether the value of L (s, a) is true, if true: if C (s, a) is equal to 0, then learning is started at this point, that is, B (s, a) is made t, if not equal to 0, no operation is performed, and after the determination for C (s, a) is completed, C (s, a) +1 is executed (the number of accesses is increased once), and U (s, a) + r is increased once)_r+1+λ_maxQ(s_t+1A) (store award); if not: step nine is performed.

Step nine: judging whether C (s, a) is equal to h (whether the access times reach the maximum counting threshold), if so: q (s, a) ═ U (s, a)/h (average of the previous h-step reward values), U (s, a) ═ 0 (clear reward), C (s, a) ═ 0 (clear access times), and at the same time, let update time E (s, a) ═ i.

Step ten: judging whether E (s, a) is larger than or equal to E (s, a), if so: let L (s, a) be true, U (s, a) be 0, and C (s, a) be 0; if not: step eleven is executed.

Step eleven: will s_t+1Put into PathList, s ← s_t+1Change the current state to s_t+1. And adding one to the value of t, and executing the step four.

Example two:

the embodiment discloses a simulation experiment for path planning of a single robot.

First, description of simulation experiment

1) When a simulation experiment is carried out, a Windows10 operating system is adopted by a software platform, an Inter Core I5-8400 is adopted by a CPU, and the size of a running memory is 16 GB. The path planning algorithm of the single robot system completes a simulation experiment by using a Python language and a TensorFlow deep learning tool, and the multi-robot path planning algorithm is compiled on matlab2016a simulation software by adopting a matlab language.

2) The environment will be described using a grid method, which divides the working space of the robot system into small grids, each of which may represent a state of the robot system. In the map, a white grid indicates a safety area, and a black grid indicates the presence of an obstacle.

The target state and the obstacle in the environment are both stationary and the obstacle and border position in the environment are unknown to the robot. In subsequent experiments, the working space of the robot was a 10 × 10 or 20 × 20 grid map, respectively.

In the simulation process, the motion route and the initial state of the robot are represented by 1, and the target state is represented by 2.

3) The MDP quadruplets of a single robot system are defined as follows:

and (3) action set: the actions that each robot can take are defined as four actions, up, down, left, and right. Represented in the grid diagram as: the robot moves along the blank grid, can move from one grid to four adjacent grids, up, down, left and right, can not transfer across grids, and can not transfer along a diagonal line (for example, upper left).

The motion space of the robot system is: a ═ {0, 1,2,3 }. Where 0 represents up, 1 represents down, 2 represents left, and 3 represents right.

State collection: in the grid diagram, each grid means a state, and the state space of the system is: s ═ {1,2,3 … 100} or S ═ 1,2,3 … 400 }. The grid state that the robot is in at any one time can be expressed as: s_t＝(x_t,y_t)。

The robot reaches the black grid (obstacle) or the yellow grid (target state), i.e. the end state. Once the robot becomes the termination state, the round of training is finished, and the robot returns to the initial state again to perform the next round of training.

Migration function: when the robot selects a certain action from the action set, the next grid after the action is executed is not an obstacle or a boundary wall, and the robot moves to the next grid.

The transfer function when the robot moves is:

the reward function: in a single robot system, each step of movement of the robot is set to obtain an immediate reward of-1, which represents the movement consumption of the robot and forces the robot to reach a target state more quickly; when the robot reaches the target state, namely the robot reaches the yellow grid, an immediate reward of 10 can be obtained; when the robot collides with an obstacle, i.e. the robot enters the black grid, an immediate reward of-10 is obtained. Thus, the stand-alone robotic system reward function may be defined as:

second, parameter initialization

In this embodiment, the setting of parameters: 1) the learning rate α is a value between 0 and 1 (if the learning rate is too small, the convergence rate is slow; the learning rate is too large and may not converge to an optimal value. The learning rate α is initialized to 0.01 herein); 2) the discount factor γ, takes a value between 0 and 1 (determining whether the robot pays more attention to the pre-sight benefit or the long-distance benefit. If the discount factor y approaches 0, then the immediate reward of the robot is more important, whereas if it approaches 1, the robot pays more attention to the long-term benefit. In the simulation experiment of the single robot path planning, the discount factor gamma is set to 0.8); 3) maximum number of exploration steps m: in the embodiment, the maximum number of exploration steps in each round of training is set to be 200 (if the number of exploration steps of the robot exceeds 200 and still does not reach the target state, the strategy adopted in the round of training is not appropriate, the training is not required to be continued, the round of training is selected to be terminated, and the next round of training is directly carried out); 4) exploration factor epsilon: setting the initial value of the exploration factor epsilon to be 0.4, setting the minimum path number MinPathNum to be 2, setting the maximum successful path finding times MaxSuccessNum to be 10, and setting the single updating step length eSize of the exploration factor according to the complexity of the environment, wherein if the environment is simpler, the step length setting can be larger, and if the environment is complex, the step length setting can be smaller.

Experiment one:

experiment one, a 10 × 10 grid map is adopted, the obstacle positions are randomly given, the initial state of the robot is (0,0), and the target state is (7, 7). FIG. 4 shows a planned path obtained by performing a simulation experiment using a conventional Q-Learning algorithm; fig. 5 shows a planned path obtained by a simulation experiment using the modified Q-Learning algorithm.

1) As shown in fig. 4 and 5: the gray grid represents the traveling path, and it can be clearly seen from the figure that there are more turning points in the conventional path (the conventional path planning method), and the path after the improvement (the path planning method in the present invention) is more gradual, which shows that the solution obtained by using the improved Q-Learning algorithm is more excellent than the solution obtained by the conventional method.

2) As can be seen from fig. 6 and 7: the traditional robot (the existing path planning method) finds a safe and collision-free path reaching the target state for the first time in about 700 seconds, and the improved robot (the path planning method in the invention) finds a safe and collision-free path reaching the target state for the first time in about 300 seconds. Therefore, in the initial stage of training, the conventional Q-Learning algorithm cannot find a path to the target state basically, and the improved algorithm can find a path to the target state faster. In addition, it can be found that the probability of successfully finding a path by the robot system gradually increases with the increase of the training time after the traditional method, but the speed and the times of the path increase are obviously faster and more after the improvement. Traditionally, convergence is not reached until around 900 seconds, and after improvement, convergence is reached again around 500 seconds.

The two points can show that the improved Q-Learning algorithm greatly increases the efficiency of the algorithm compared with the traditional algorithm.

Experiment two:

in the second experiment, a 20 × 20 grid map is taken, the obstacle positions are randomly set, the initial state of the robot is (0,3), and the target state is (15, 15). Compared with the environmental model of the first experiment, the environmental model of the second experiment becomes more complex, the number of randomly given obstacles is more, and a plurality of concave areas exist, so that the difficulty of robot path planning is increased.

1) As shown in fig. 8: the gray grid represents the traveling path of the robot, and the planned path which successfully reaches the target state is displayed in the grid map and is obtained based on a single robot path planning algorithm (the path planning method in the invention) of the improved Q-Learning algorithm.

2) The environmental model of experiment 2 became complex and there were concave regions, as shown by fig. 9 and 10: when the traditional Q-Learning algorithm (existing path planning method) is used for path planning, after 1000 epsilon are learned and trained, a path reaching the target state is still not successfully found. By observing the training process, the Learning process is not successful because the training process uses the traditional Q-Learning algorithm (the existing path planning method) to continuously fall into the concave area, and the Learning process is not successful. The improved Q-Learning algorithm (the path planning method in the present invention) is still feasible in a complex environment, and a safe collision-free path to reach the target state is successfully found at about step 500, and then gradually converges. In addition, in the training time, although the traditional Q-Learning algorithm (the existing path planning method) does not converge, after 1000 epsilon are trained, the time consumption of the improved algorithm is smaller, which indicates that the updating efficiency is faster.

The two comparison results are combined, so that the improved Q-Learning algorithm (the path planning method in the invention) is still feasible in a complex environment, has a higher updating speed and has certain practical application significance.

The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims

1. The single robot path planning method based on the Q-Learning algorithm is characterized by comprising the following steps of:

s1: initializing exploration factors, maximum iteration times, termination state parameters, target state parameters, maximum counting threshold values, update starting time, iteration times, current time, action value functions, access times of state-action pairs, successful paths, successful path storage tables, successful path searching times, maximum successful path searching times, successful path numbers and minimum path numbers of the single robot system;

s2: judging whether the iteration times are larger than the maximum iteration times, if so: step S6 is executed; if not: firstly, judging whether the successful path searching times are greater than the prestored maximum successful path searching times, if so: updating the value of the exploration factor and executing the next step; if not: executing the next step;

2. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 1, wherein: in step S4, the action value function is updated according to the formula Q (S, a) ═ U (S, a)/h, where Q (S, a) is the action value function, U (S, a) is the stored reward function, and h is the maximum count threshold.

3. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 1, wherein: in step S3, if the running state parameter is equal to the ending state parameter and equal to the target state parameter, a preset number of successful way-finding times is performed by self-adding one before performing step S2.

4. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 3, wherein: in step S2, when updating the value of the search factor, it is first determined whether the number of successful paths is less than the pre-stored minimum number of paths, if so: executing epsilon ' ═ epsilon + eSize x (Minpathnum-PathNum), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, Minpathnum is the minimum path number, and PathNum is the successful path number; if not: executing epsilon ' ═ eSize × (i/eccycle), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, i is the iteration number, and eccycle is the changing period of the prestored exploration factor.

5. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 1, wherein: in step S3, when comparing the random number with the search factor, if the random number is greater than the search factor, selecting an action command according to a pre-stored probability model; and if the random number is less than or equal to the exploration factor, randomly selecting an action command from a pre-stored action set.

6. The method for Q-Learning algorithm based standalone robot path planning as claimed in claim 5, wherein: in step S3, the probabilistic model selects the formula of the action command as

Wherein P (s | a)_k) Selecting an action command a under a state parameter s_kProbability of (a), Q (s, a)_k) Is an action command a under a state parameter s_kThe value of Q of (A) is,

the Q values of all action commands under the state parameter s.