CN110378439B - Single robot path planning method based on Q-Learning algorithm - Google Patents

Single robot path planning method based on Q-Learning algorithm Download PDF

Info

Publication number
CN110378439B
CN110378439B CN201910737476.6A CN201910737476A CN110378439B CN 110378439 B CN110378439 B CN 110378439B CN 201910737476 A CN201910737476 A CN 201910737476A CN 110378439 B CN110378439 B CN 110378439B
Authority
CN
China
Prior art keywords
action
path
state parameter
updating
epsilon
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910737476.6A
Other languages
Chinese (zh)
Other versions
CN110378439A (en
Inventor
李波
易洁
梁宏斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Technology
Original Assignee
Chongqing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Technology filed Critical Chongqing University of Technology
Priority to CN201910737476.6A priority Critical patent/CN110378439B/en
Publication of CN110378439A publication Critical patent/CN110378439A/en
Application granted granted Critical
Publication of CN110378439B publication Critical patent/CN110378439B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/29Graphical models, e.g. Bayesian networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • G06Q10/047Optimisation of routes or paths, e.g. travelling salesman problem

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Economics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Marketing (AREA)
  • Remote Sensing (AREA)
  • Automation & Control Theory (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention relates to the technical field of path planning of path robots, in particular to a single robot path planning method based on a Q-Learning algorithm, which comprises the following steps: initializing parameters of an algorithm; selecting an action instruction, and calculating and generating an operation state parameter and a reward function according to the action instruction; if the running state parameter is equal to the terminating state parameter and equal to the target state parameter, storing the successful path into a successful path storage table; otherwise, when the updating starting time is less than or equal to the current time and the access times of the state-action pairs are equal to the maximum counting threshold value, updating the action value function and storing the running state parameters into the success path; repeating the steps until the maximum iteration times is reached; and repeating the action instruction selection and the state parameter generation according to the action value function to obtain the optimal path of the single robot. The method can better improve the updating Learning speed and the path planning effect of the Learning system when the Q-Learning algorithm is used for path planning of the single robot.

Description

Single robot path planning method based on Q-Learning algorithm
Technical Field
The invention relates to the technical field of robot path planning, in particular to a single robot path planning method based on a Q-Learning algorithm.
Background
The mobile robot has a wide range of applications, for example, the figure of the mobile robot is found in various fields such as home, agriculture, industry, military and the like. Three major cores in the research field of controlling the movement of robots are the positioning of the robots, the assignment of tasks, and path planning techniques. The path planning is a primary condition for the mobile robot to reach a task target and complete task content. For example: the household service type cleaning robot needs to carry out reasonable path planning on the indoor environment to complete the cleaning task; the agricultural picking robot needs path planning to pass through the crops to complete picking tasks; industrial robots also require path planning to accomplish a given task in a shared workspace.
The single robot system is widely applied to the aspects of family service, agricultural assistance, industrial environment, military assistance and the like. In these applications, the single robot system path planning is particularly important, and the single robot system path planning refers to: finding a path from the initial state to the target state in the working environment to avoid all obstacles requires machine learning, and in the related art, the most common learning method is reinforcement learning.
The Q-Learning algorithm is an important algorithm in reinforcement Learning, and in the related art, the Q-Learning algorithm is applied to path planning of a robot system. The Learning process of the Q-Learning algorithm is an iterative process that requires a gradual update of the Q-value table (action value function) through constant trial and error and action selection. Namely: setting a reward function, selecting an action instruction by the robot according to an epsilon-greedy strategy (epsilon-greedy, epsilon is an exploration factor (epsilon is more than or equal to 0 and less than or equal to 1)), executing the action instruction, updating a Q value table according to the reward function, then generating a state parameter, selecting a next action according to the state parameter and the epsilon-greedy strategy, then continuously executing the action instruction and updating the Q value table until a final Q value table is obtained through updating, and finally obtaining an optimal path according to the Q value table.
The existing Q-Learning algorithm needs to continuously select and update the Q value to improve the selection of the action command, that is, the Q value needs to be updated every time the environment is explored, which results in slow updating and Learning speed of the Learning system. In the related art, in order to ensure the updating and learning speed, the value of the exploration factor epsilon is continuously reduced as the training time of the algorithm is increased, namely, the optimal action is performed more, so that the optimal action is converged into the corresponding solution quickly. The cost of this approach is: the learning system misses the optimal solution due to insufficient environment exploration (the value of the exploration factor epsilon is too small), and can only converge to a suboptimal solution, and more likely converge to a common solution, and the defect can affect the effect of path planning.
Disclosure of Invention
Aiming at the defects of the prior art, the technical problems to be solved by the invention are as follows: when the Q-Learning algorithm is used for path planning of a single robot, the updating Learning speed and the path planning effect of the Learning system are improved better.
In order to solve the technical problems, the invention adopts the following technical scheme:
the single robot path planning method based on the Q-Learning algorithm comprises the following steps:
s1: initializing an exploration factor, the maximum iteration number, a termination state parameter, a target state parameter, a maximum counting threshold value, an update starting time, an iteration number, a current time, an action value function, the access number of a state-action pair, a successful path and a successful path storage table of the single robot system;
s2: judging whether the iteration times are larger than the maximum iteration times, if so: step S6 is executed; if not: initializing the current state parameters and executing the next step;
s3: generating a random number, comparing the random number with the exploration factor, selecting an action instruction, and calculating and generating an operation state parameter and a reward function after the robot executes the action instruction according to the action instruction; then, judging whether the running state parameter is equal to the termination state parameter, if so: continuing to judge whether the running state parameter is equal to the target state parameter, if so, storing the successful path into a successful path storage table, executing the self-adding of the iteration times, returning to the step S2, otherwise, executing the self-adding of the iteration times, and returning to the step S2; if not: executing the next step;
s4: judging whether the updating starting time is less than or equal to the current time, if so: storing a reward function, executing the access times of the state-action pair to be self-increased by one, and executing the next step; if not: judging whether the access times of the state-action pairs are equal to a maximum counting threshold value or not, if so, updating an action value function, executing the next step, and if not, executing the next step;
s5: storing the running state parameter into the success path, performing self-addition of one at the current moment, and returning to the step S3;
s6: acquiring an action value function, selecting an action instruction from the action value function according to a preset initial state parameter, and repeating: and executing the action command to generate a state parameter, selecting the action command according to the state parameter, and obtaining the optimal path of the single robot system when the generated state parameter is equal to a preset target state parameter.
In this way, whether to update the action value function (Q value) is determined by setting a maximum count threshold in the Q-Learning algorithm and comparing the number of accesses of the state-action pair with the maximum count threshold; namely: the update action value function is started when the number of accesses to a state-action pair reaches a maximum count threshold. Firstly, the exploration on the environment cannot be reduced in the scheme, and the path planning effect can be ensured; secondly, the scheme not only reduces the calculated amount of the system, but also greatly improves the updating and learning speed; thirdly, the action value function updating mode of the scheme has multi-step forecasting capability, the influence of a plurality of state-action pairs on the action value function in the future is considered, and the control strategy obtained by learning is more reasonable; finally, in the scheme, the access times of the state-action pairs are selected as the basis for updating the action value function, so that the preorder steps are not influenced on the premise of ensuring the updating learning speed, namely, the value of the exploration factor is not required to be reduced, and the problem that the learning system misses the optimal solution due to insufficient exploration completeness of the environment can be avoided.
Therefore, when the Q-Learning algorithm is used for path planning of the single robot, the updating Learning speed and the path planning effect of the Learning system can be better improved.
Preferably, in step S4, the action value function is updated according to the formula Q (S, a) ═ U (S, a)/h, where Q (S, a) is the action value function, U (S, a) is the stored reward function, and h is the maximum count threshold.
Therefore, the updated action value function is the average value of all stored reward functions, on one hand, the calculated amount can be reduced through an average calculation method, and the accuracy of the action value function is improved; on the other hand, the calculation mode does not need to calculate the qualification trace matrix any more, and the complexity of calculation is further reduced.
Preferably, in step S3, if the operation status parameter is equal to the end status parameter and equal to the target status parameter, a preset number of successful way-finding times is added by one before step S2 is executed.
Therefore, the Learning effect of the Learning system can be fed back in time by recording and updating the successful path searching times, and the updating of the decision searching factor can be assisted, which is beneficial to assisting in solving the problem of the balance between searching and utilization of the Q-Learning algorithm.
Preferably, in step S2, if the iteration count is less than the maximum iteration count, it is first determined whether the successful route searching count is greater than a pre-stored maximum successful route searching count, and if so: updating the value of the exploration factor and executing the next step; if not: the next step is performed.
Therefore, the learning effect of the learning system can be fed back through the successful path finding times, and the value of the exploration factor is continuously updated through the learning effect, so that the epsilon-greedy strategy has stronger adaptability and can better accord with the operation rule.
Preferably, in step S3, if the running state parameter is equal to the ending state parameter and equal to the target state parameter, the number of successful paths is incremented by one before step S2 is executed.
Therefore, the Learning effect of the Learning system can be fed back in time by recording and updating the successful path searching times, and the follow-up searching factor can be determined to be updated, which is also beneficial to assisting in solving the problem of the balance between searching and utilization of the Q-Learning algorithm.
Preferably, in step S2, when the value of the search factor is updated, it is determined whether the number of successful paths is smaller than the pre-stored minimum number of paths, and if so: executing epsilon ' ═ epsilon + eSize x (Minpathnum-PathNum), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, Minpathnum is the minimum path number, and PathNum is the successful path number; if not: executing epsilon ' ═ eSize × (i/eccycle), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, i is the iteration number, and eccycle is the change period of the stored exploration factor.
Therefore, the learning effect of the learning system can be fed back more accurately and timely through the combination of the successful path number and the successful path finding times, so that the value of the exploration factor is continuously updated through the learning effect, the epsilon-greedy strategy has stronger adaptability, and the operation rule can be better met.
Preferably, in step S3, when comparing the random number with the search factor, if the random number is greater than the search factor, selecting an action command according to a pre-stored probability model; and if the random number is less than or equal to the exploration factor, randomly selecting an action command from a pre-stored action set.
Therefore, the action selection strategy based on the probability is realized through the comparison of the random number and the exploration factor and the probability model, the probability that the action with a larger value of the action value function is selected is higher, and the problem that the selection of the maximum value has a deviation value due to the influence of noise is solved.
Preferably, in step S3, the probabilistic model selects the formula of the action command as
Figure GDA0002900973720000041
Wherein P (s | a)k) Selecting action command a under selected state parameter SkProbability of (a), Q (s, a)k) Is an action command a under the state parameter SkThe value of Q of (A) is,
Figure GDA0002900973720000042
the sum of the Q values of all the action commands in the state parameter S.
Therefore, the probability model can make the probability of selecting the action with a larger value of the action function value larger through pre-training and learning, and is beneficial to solving the problem of maximization deviation.
Drawings
For purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made in detail to the present invention as illustrated in the accompanying drawings, in which:
FIG. 1 is a logic diagram of a single robot path planning method based on a Q-Learning algorithm according to an embodiment;
FIG. 2 is a flowchart illustrating a method for planning a path of a single robot based on a Q-Learning algorithm according to an embodiment;
FIG. 3 is a flowchart illustrating a method for updating a Q value table according to an embodiment;
FIG. 4 is a schematic diagram of a path planned by a conventional Q-Learning algorithm in experiment one of the second embodiment;
FIG. 5 is a schematic diagram of the path planned by the improved Q-Learning algorithm of the first experiment of the second embodiment of the present invention;
FIG. 6 is a line graph of the time taken for the conventional Q-Learning algorithm to converge in experiment one of example two;
FIG. 7 is a line graph of the time taken for the improved Q-Learning algorithm to converge in experiment one of example two;
FIG. 8 is a schematic diagram of a path planned by the improved Q-Learning algorithm in experiment two of the second embodiment;
FIG. 9 is a line graph of the number of training times required for the convergence of the conventional Q-Learning algorithm in experiment two of example two;
FIG. 10 is a line graph of the number of training passes required for convergence of the improved Q-Learning algorithm in experiment two of example two.
Detailed Description
The following is further detailed by the specific embodiments:
the Q-Learning algorithm was proposed by Watkins in 1989, and is an important algorithm in the reinforcement Learning algorithm.
Updating rule of one, Q-Learning algorithm
The robot under the Q-Learning algorithm does not know the overall environment, only knows the selectable action set in the current state, and usually needs to construct an immediate reward return matrix R for representing the action reward value from the state s to the next state s'. And constructing and calculating a Q value table (or called a Q matrix) for guiding the action of the robot through R.
Thus, each state-action pair is set as < S, a >, and the Q learning algorithm estimates the value function Q (S, a) of the state-action pair to obtain the control strategy. One of the simplest forms of Q learning is single step Q learning, in which the Q value is modified as follows:
Figure GDA0002900973720000051
equation (1-1) can only hold if an optimal strategy is used, and at the beginning of learning, equation (1-1) is not equal on both sides, and its error is:
Figure GDA0002900973720000052
the obtained update rule is as follows:
Qt(st,at)←Qt(st,at)+αΔQt(st,at) (1-3)
namely:
Figure GDA0002900973720000053
wherein: st: a current state; a ist: an action selected in the current state; st+1: performing action atThe next state;
rt+1: performing action atLater instant awards; q(s)t,at): is a state stLower robot performs action atThe resulting cumulative weighted reward is then a function of the value of the state-action pair.
α: controlling the learning rate of convergence, wherein alpha is more than 0 and less than 1, and gradually approaching the Q value to the optimal value by continuously trying to search a space;
γ: the discount factor, γ ≦ 0 ≦ γ < 1, γ ≦ 0 indicates an instant prize, γ tends to 1 indicates a future prize, that is, γ determines the degree of influence of the time on the prize, and indicates the degree of sacrificing the current prize in exchange for the long-term benefit.
Step of two, Q-Learning algorithm
Firstly, initializing a Q (s, a) value, in a state s, selecting an action a by the robot according to an action selection strategy pi to obtain a next state s' and an incentive value gamma, and then correcting the Q (s, a) value according to an updating rule; the action selection and the correction of the Q (s, a) value are repeated until the learning is finished.
A typical Q-Learning algorithm is completed as follows:
Figure GDA0002900973720000061
convergence of the three, Q-Learning algorithm
After the following four convergence conditions are satisfied, Q (s, a) can converge on Q with a hundred percent probability*(s,a):
1) The environment has the property of Markov decision process;
2) expressing the value function by using a lookup table, namely using the table to store Q (s, a) values (Q matrix);
3) each state-action pair < S, A > can be iterated indefinitely with Q (S, a) update formula;
4) a reasonable learning rate alpha.
Fourthly, the balance problem of exploration and utilization
Exploration-utilization balance is a very fundamental concept in reinforcement learning, where each time a selection is made, when to select the best action that has already been explored, and when to explore to try an unknown action, which is the exploration-utilization balance problem.
The epsilon greedy strategy (epsilon-greedy) is the most commonly used method for solving the exploration-utilization balance problem, and is also the exploration strategy mu adopted in the Q-Learning algorithm. The epsilon-greedy strategy formula is as follows:
Figure GDA0002900973720000062
wherein, epsilon: searching factors (epsilon is more than or equal to 0 and less than or equal to 1);
σ: the algorithm randomly generates a number between 0 and 1 in each step.
As can be seen from the formula (1-5), when the exploration factor epsilon is larger, the learning system is biased to explore the environment and try random actions; when the exploration factor epsilon is small, the learning system then tends to choose to perform the best known action. Therefore, the choice of the value of epsilon is very important.
Defects of five and Q-Learning algorithm
Through analysis of the existing Q-Learning algorithm, the following problems are found when the Q-Learning algorithm is applied to path planning of a single robot:
1) balancing the problem of exploration and exploitation
In order to ensure the convergence speed of the existing Q-Learning algorithm, the value of the search factor epsilon is usually continuously decreased with the increase of the training time of the algorithm, i.e. more optimal actions are performed to quickly converge into the corresponding solution.
However, the learning system may miss the optimal solution and converge to a sub-optimal solution, or more likely to converge to a normal solution, because the environment is not fully explored.
2) Maximum deviation problem
In the Q-Learning algorithm, the strategy updating method is that
Figure GDA0002900973720000071
Where the max method is used, i.e. when updating the policy, one action a is chosen constantly with the largest Q (s, a). However, selecting this method may affect the final result due to some noise terms, and there is a maximum bias problem.
Although repeatedly applying the action of Q (s, a) max results in an action strategy that maximizes the jackpot, the performance of the Q-Learning algorithm becomes increasingly worse when Q (s, a) is not that accurate. And when learning continuously adopts the action of the maximum Q (s, a), the action is probably related to the state action of high Q (s, a) value in the early training period, so that the learning converges too fast, and some possible optimal strategies are missed.
3) Slow update speed problem
The Learning process of the Q-Learning algorithm is an iterative process, which requires a gradual improvement of the mapping strategy from state to action through continuous trial and error and action selection, which requires the Learning system to perform multiple trial and error and correction on the feedback information for each possible state action to obtain a more suitable control strategy.
Aiming at the problems, the invention provides a single robot path planning method based on a Q-Learning algorithm, which comprises the following steps:
s1: initializing an exploration factor, the maximum iteration number, a termination state parameter, a target state parameter, a maximum counting threshold value, an update starting time, an iteration number, a current time, an action value function, the access number of a state-action pair, a successful path and a successful path storage table of the single robot system;
s2: judging whether the iteration times are larger than the maximum iteration times, if so: step S6 is executed; if not: initializing the current state parameters and executing the next step;
s3: generating a random number, comparing the random number with the exploration factor, selecting an action instruction, and calculating and generating an operation state parameter and a reward function after the robot executes the action instruction according to the action instruction; then, judging whether the running state parameter is equal to the termination state parameter, if so: continuing to judge whether the running state parameter is equal to the target state parameter, if so, storing the successful path into a successful path storage table, executing the self-adding of the iteration times, returning to the step S2, otherwise, executing the self-adding of the iteration times, and returning to the step S2; if not: executing the next step;
s4: judging whether the updating starting time is less than or equal to the current time, if so: storing a reward function, executing the access times of the state-action pair to be self-increased by one, and executing the next step; if not: judging whether the access times of the state-action pairs are equal to a maximum counting threshold value or not, if so, updating an action value function, executing the next step, and if not, executing the next step;
s5: storing the running state parameter into the success path, performing self-addition of one at the current moment, and returning to the step S3;
s6: acquiring an action value function, selecting an action instruction from the action value function according to a preset initial state parameter, and repeating: and executing the action command to generate a state parameter, selecting the action command according to the state parameter, and obtaining the optimal path of the single robot system when the generated state parameter is equal to a preset target state parameter.
In this way, whether to update the action value function (Q value) is determined by setting a maximum count threshold in the Q-Learning algorithm and comparing the number of accesses of the state-action pair with the maximum count threshold; namely: the update action value function is started when the number of accesses to a state-action pair reaches a maximum count threshold. Firstly, the exploration on the environment cannot be reduced in the scheme, and the path planning effect can be ensured; secondly, the scheme not only reduces the calculated amount of the system, but also greatly improves the updating and learning speed; thirdly, the action value function updating mode of the scheme has multi-step forecasting capability, the influence of a plurality of state-action pairs on the action value function in the future is considered, and the control strategy obtained by learning is more reasonable; finally, in the scheme, the access times of the state-action pairs are selected as the basis for updating the action value function, so that the preorder steps are not influenced on the premise of ensuring the updating learning speed, namely, the value of the exploration factor is not required to be reduced, and the problem that the learning system misses the optimal solution due to insufficient exploration completeness of the environment can be avoided.
The first embodiment is as follows:
as shown in fig. 1: the single robot path planning method based on the Q-Learning algorithm comprises the following steps:
s1: the method comprises the steps of initializing an action set A, a state set S, a maximum iteration number n, a maximum searching step number m, a minimum path number MinPathNum, a maximum successful path searching number MaxSuccessNum, a searching factor epsilon, a searching factor single updating step size eSize, a searching factor changing period eCycle, a maximum counting threshold h, an updating starting time B (S, a), an updating finishing time, an action value function Q (S, a), an access number C (S, a) of a state action pair, a reward function storage U (S, a), a successful path searching number SuccessNum, a successful path number PathNum, a successful path PathList, a successful path storage List, an iteration number i, a current time t and a target state parameter.
S2: judging whether the iteration times i are larger than the maximum iteration times n, if so: step S6 is executed; if not: judging whether the successful path searching times Successfum are greater than the prestored maximum successful path searching times MaxSuccessfum or not, if so, judging that: updating the value of the exploration factor epsilon and executing the next step; if not: the next step is performed.
When the value of the exploration factor is updated, whether the successful path number PathNum is smaller than the prestored minimum path number MinPathNum is judged, if yes, the following steps are carried out: executing epsilon ' ═ epsilon + eSize x (Minpathnum-PathNum), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, Minpathnum is the minimum path number, and PathNum is the successful path number; if not: executing epsilon ' ═ eSize × (i/eccycle), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, i is the iteration number, and eccycle is the change period of the stored exploration factor.
S3: generating a random number sigma belonged to (0, 1), comparing the value of the random number sigma with the value of the exploration factor epsilon, and selecting an action command atAccording to the action command atCalculating and generating an operation state parameter s after the robot executes the action commandt+1And a reward function rt+1(ii) a Judging an operating state parameter st+1Whether it is equal to the termination state parameter, if so: judging the operating state parameter st+1If the number of successful path numbers is not equal to the target state parameter, executing step S2, executing step S2, and executing step S; if not: the next step is performed.
If the value of the random number sigma is larger than the exploration factor epsilon, selecting an action a according to a prestored probability modelt(ii) a If the value of the random number sigma is less than or equal to the exploration factor epsilon, randomly selecting the action from the action set Aat(ii) a Selecting action command a by probability modeltIs of the formula
Figure GDA0002900973720000091
Wherein P (s | a)k) Selecting action command a under selected state parameter SkProbability of (a), Q (s, a)k) Is an action command a under the state parameter SkThe value of Q of (A) is,
Figure GDA0002900973720000092
the sum of the Q values of all the action commands in the state parameter S.
S4: judging whether the updating start time B (s, a) is less than or equal to the current time t, if so: then the reward function rt+1Storing in a reward function storage U (s, a), performing a self-increment of the number of accesses C (s, a) of the state-action pair, and performing the next step; if not: judging whether the access times C (s, a) of the state-action pairs are equal to a maximum counting threshold h, if so, updating an action value function Q (s, a), and executing the next step, if not: the next step is performed.
The formula for updating the motion value function is Q (s, a) ═ U (s, a)/h, where Q (s, a) is the motion value function, U (s, a) is the stored reward function, and h is the maximum count threshold.
S5: running state parameter st+1Stores the path into the success path PathList, performs self-addition of one at the current time t, and performs step S3.
S6: acquiring an action value function, selecting an action instruction from the action value function according to a preset initial state parameter, and repeating: and executing the action command to generate a state parameter, selecting the action command according to the state parameter, and obtaining the optimal path of the single robot system when the generated state parameter is equal to a preset target state parameter.
In order to better introduce the path planning process, the embodiment also discloses a flow chart of the single robot path planning method based on the Q-Learning algorithm.
As shown in fig. 2 and 3: the single robot path planning process based on the Q-Learning algorithm comprises the following steps:
the method comprises the following steps: initializing an action value function Q (S, a), an action set A, a state set S, a maximum iteration number n, a maximum searching step number m, a minimum path number MinPathNum, a maximum successful path searching number MaxSuccessfum, a searching factor epsilon, a searching factor single updating step size eSize, a searching factor changing period eCycle, a state action pair accessed number C (S, a), a starting updating time B (S, a), a finishing updating time E (S, a) reward function storage U (S, a), whether L (S, a) is learned or not, a maximum counting threshold h, a successful path searching number Successfum, a successful path number PathNum, a successful path PathList, a successful path storage List, an iteration number i and a current time t.
Wherein Q (s, a) is 0, C (s, a) is 0, U (s, a) is 0, successum is 0, PathNum is 0, PathList is 0, List is 0, i is 1, and t is 1.
Step two: judging whether i is larger than n, if so: then the learning is ended; if not: if yes, executing step S3, if yes, executing step S, and if no, executing step S, and if yes, executing step S, and otherwise, executing step S.
Wherein, when updating the value of the search factor epsilon: if PathNum is less than the MinPathNum, then the formula ε + eSize x (MinPathNum-PathNyn) is adopted; if PathNum is greater than or equal to MinPathNum, adopting a formula epsilon-eSize x (i/eCycle); in the formula, epsilon is an exploration factor, eZize is a single updating step length of the exploration factor, MinPathNum is the minimum path number, PathNum is the successful path number, i is the iteration number, and eCycle is an exploration factor changing period.
Step three: the state S is initialized, S ∈ S.
Step four: judging whether t is larger than m, if yes: executing i +1 and returning to the step two; if not: generating a random number sigma epsilon (0, 1), judging whether sigma is greater than epsilon, if so, selecting the state s according to a probability expressiontAct a of executiontIf not, then randomly selecting action at,at∈A。
Wherein the action a is selected according to a probabilistic formulatThe formula of (1) is:
Figure GDA0002900973720000101
wherein P (s | a)k) Selecting action command a under selected state parameter SkProbability of (a), Q (s, a)k) Is an action command a under the state parameter SkThe value of Q of (A) is,
Figure GDA0002900973720000102
the sum of the Q values of all the action commands in the state parameter S.
Step five: performing action atObtain the state st+1And a prize rt+1
Step six: judging the state st+1Whether the state is a termination state or not is judged, if so, the method comprises the following steps: then re-determine the state st+1If it is the target state, if it is the state st+1If the target state is reached, executing the following operation that after adding one to the SuccessNum value, judging that the PathList is not in the List at the moment, if the PathList is not in the List, adding the PathList into the List, adding one to the PathNum value, adding one to the value of i, executing the step two, if the target state is not reached, adding one to the value of i, and returning to the step two; if not: step seven is performed.
Step seven: judging whether B (s, a) is less than or equal to t (namely when the updating time of the last action value function Q (s, a) is before the step), if so: l (s, a) ═ true, even if it does learning; if not, executing step eight.
Step eight: determining whether the value of L (s, a) is true, if true: if C (s, a) is equal to 0, then learning is started at this point, that is, B (s, a) is made t, if not equal to 0, no operation is performed, and after the determination for C (s, a) is completed, C (s, a) +1 is executed (the number of accesses is increased once), and U (s, a) + r is increased once)r+1maxQ(st+1A) (store award); if not: step nine is performed.
Step nine: judging whether C (s, a) is equal to h (whether the access times reach the maximum counting threshold), if so: q (s, a) ═ U (s, a)/h (average of the previous h-step reward values), U (s, a) ═ 0 (clear reward), C (s, a) ═ 0 (clear access times), and at the same time, let update time E (s, a) ═ i.
Step ten: judging whether E (s, a) is larger than or equal to E (s, a), if so: let L (s, a) be true, U (s, a) be 0, and C (s, a) be 0; if not: step eleven is executed.
Step eleven: will st+1Put into PathList, s ← st+1Change the current state to st+1. And adding one to the value of t, and executing the step four.
Example two:
the embodiment discloses a simulation experiment for path planning of a single robot.
First, description of simulation experiment
1) When a simulation experiment is carried out, a Windows10 operating system is adopted by a software platform, an Inter Core I5-8400 is adopted by a CPU, and the size of a running memory is 16 GB. The path planning algorithm of the single robot system completes a simulation experiment by using a Python language and a TensorFlow deep learning tool, and the multi-robot path planning algorithm is compiled on matlab2016a simulation software by adopting a matlab language.
2) The environment will be described using a grid method, which divides the working space of the robot system into small grids, each of which may represent a state of the robot system. In the map, a white grid indicates a safety area, and a black grid indicates the presence of an obstacle.
The target state and the obstacle in the environment are both stationary and the obstacle and border position in the environment are unknown to the robot. In subsequent experiments, the working space of the robot was a 10 × 10 or 20 × 20 grid map, respectively.
In the simulation process, the motion route and the initial state of the robot are represented by 1, and the target state is represented by 2.
3) The MDP quadruplets of a single robot system are defined as follows:
and (3) action set: the actions that each robot can take are defined as four actions, up, down, left, and right. Represented in the grid diagram as: the robot moves along the blank grid, can move from one grid to four adjacent grids, up, down, left and right, can not transfer across grids, and can not transfer along a diagonal line (for example, upper left).
The motion space of the robot system is: a ═ {0, 1,2,3 }. Where 0 represents up, 1 represents down, 2 represents left, and 3 represents right.
State collection: in the grid diagram, each grid means a state, and the state space of the system is: s ═ {1,2,3 … 100} or S ═ 1,2,3 … 400 }. The grid state that the robot is in at any one time can be expressed as: st=(xt,yt)。
The robot reaches the black grid (obstacle) or the yellow grid (target state), i.e. the end state. Once the robot becomes the termination state, the round of training is finished, and the robot returns to the initial state again to perform the next round of training.
Migration function: when the robot selects a certain action from the action set, the next grid after the action is executed is not an obstacle or a boundary wall, and the robot moves to the next grid.
The transfer function when the robot moves is:
Figure GDA0002900973720000121
Figure GDA0002900973720000122
Figure GDA0002900973720000123
Figure GDA0002900973720000124
the reward function: in a single robot system, each step of movement of the robot is set to obtain an immediate reward of-1, which represents the movement consumption of the robot and forces the robot to reach a target state more quickly; when the robot reaches the target state, namely the robot reaches the yellow grid, an immediate reward of 10 can be obtained; when the robot collides with an obstacle, i.e. the robot enters the black grid, an immediate reward of-10 is obtained. Thus, the stand-alone robotic system reward function may be defined as:
Figure GDA0002900973720000131
second, parameter initialization
In this embodiment, the setting of parameters: 1) the learning rate α is a value between 0 and 1 (if the learning rate is too small, the convergence rate is slow; the learning rate is too large and may not converge to an optimal value. The learning rate α is initialized to 0.01 herein); 2) the discount factor γ, takes a value between 0 and 1 (determining whether the robot pays more attention to the pre-sight benefit or the long-distance benefit. If the discount factor y approaches 0, then the immediate reward of the robot is more important, whereas if it approaches 1, the robot pays more attention to the long-term benefit. In the simulation experiment of the single robot path planning, the discount factor gamma is set to 0.8); 3) maximum number of exploration steps m: in the embodiment, the maximum number of exploration steps in each round of training is set to be 200 (if the number of exploration steps of the robot exceeds 200 and still does not reach the target state, the strategy adopted in the round of training is not appropriate, the training is not required to be continued, the round of training is selected to be terminated, and the next round of training is directly carried out); 4) exploration factor epsilon: setting the initial value of the exploration factor epsilon to be 0.4, setting the minimum path number MinPathNum to be 2, setting the maximum successful path finding times MaxSuccessNum to be 10, and setting the single updating step length eSize of the exploration factor according to the complexity of the environment, wherein if the environment is simpler, the step length setting can be larger, and if the environment is complex, the step length setting can be smaller.
Experiment one:
experiment one, a 10 × 10 grid map is adopted, the obstacle positions are randomly given, the initial state of the robot is (0,0), and the target state is (7, 7). FIG. 4 shows a planned path obtained by performing a simulation experiment using a conventional Q-Learning algorithm; fig. 5 shows a planned path obtained by a simulation experiment using the modified Q-Learning algorithm.
1) As shown in fig. 4 and 5: the gray grid represents the traveling path, and it can be clearly seen from the figure that there are more turning points in the conventional path (the conventional path planning method), and the path after the improvement (the path planning method in the present invention) is more gradual, which shows that the solution obtained by using the improved Q-Learning algorithm is more excellent than the solution obtained by the conventional method.
2) As can be seen from fig. 6 and 7: the traditional robot (the existing path planning method) finds a safe and collision-free path reaching the target state for the first time in about 700 seconds, and the improved robot (the path planning method in the invention) finds a safe and collision-free path reaching the target state for the first time in about 300 seconds. Therefore, in the initial stage of training, the conventional Q-Learning algorithm cannot find a path to the target state basically, and the improved algorithm can find a path to the target state faster. In addition, it can be found that the probability of successfully finding a path by the robot system gradually increases with the increase of the training time after the traditional method, but the speed and the times of the path increase are obviously faster and more after the improvement. Traditionally, convergence is not reached until around 900 seconds, and after improvement, convergence is reached again around 500 seconds.
The two points can show that the improved Q-Learning algorithm greatly increases the efficiency of the algorithm compared with the traditional algorithm.
Experiment two:
in the second experiment, a 20 × 20 grid map is taken, the obstacle positions are randomly set, the initial state of the robot is (0,3), and the target state is (15, 15). Compared with the environmental model of the first experiment, the environmental model of the second experiment becomes more complex, the number of randomly given obstacles is more, and a plurality of concave areas exist, so that the difficulty of robot path planning is increased.
1) As shown in fig. 8: the gray grid represents the traveling path of the robot, and the planned path which successfully reaches the target state is displayed in the grid map and is obtained based on a single robot path planning algorithm (the path planning method in the invention) of the improved Q-Learning algorithm.
2) The environmental model of experiment 2 became complex and there were concave regions, as shown by fig. 9 and 10: when the traditional Q-Learning algorithm (existing path planning method) is used for path planning, after 1000 epsilon are learned and trained, a path reaching the target state is still not successfully found. By observing the training process, the Learning process is not successful because the training process uses the traditional Q-Learning algorithm (the existing path planning method) to continuously fall into the concave area, and the Learning process is not successful. The improved Q-Learning algorithm (the path planning method in the present invention) is still feasible in a complex environment, and a safe collision-free path to reach the target state is successfully found at about step 500, and then gradually converges. In addition, in the training time, although the traditional Q-Learning algorithm (the existing path planning method) does not converge, after 1000 epsilon are trained, the time consumption of the improved algorithm is smaller, which indicates that the updating efficiency is faster.
The two comparison results are combined, so that the improved Q-Learning algorithm (the path planning method in the invention) is still feasible in a complex environment, has a higher updating speed and has certain practical application significance.
The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims (6)

1. The single robot path planning method based on the Q-Learning algorithm is characterized by comprising the following steps of:
s1: initializing exploration factors, maximum iteration times, termination state parameters, target state parameters, maximum counting threshold values, update starting time, iteration times, current time, action value functions, access times of state-action pairs, successful paths, successful path storage tables, successful path searching times, maximum successful path searching times, successful path numbers and minimum path numbers of the single robot system;
s2: judging whether the iteration times are larger than the maximum iteration times, if so: step S6 is executed; if not: firstly, judging whether the successful path searching times are greater than the prestored maximum successful path searching times, if so: updating the value of the exploration factor and executing the next step; if not: executing the next step;
s3: generating a random number, comparing the random number with the exploration factor, selecting an action instruction, and calculating and generating an operation state parameter and a reward function after the robot executes the action instruction according to the action instruction; then, judging whether the running state parameter is equal to the termination state parameter, if so: continuing to judge whether the running state parameter is equal to the target state parameter, if so, storing the successful path into a successful path storage table, executing the self-adding of the iteration times, returning to the step S2, otherwise, executing the self-adding of the iteration times, and returning to the step S2; if not: executing the next step;
s4: judging whether the updating starting time is less than or equal to the current time, if so: storing a reward function, executing the access times of the state-action pair to be self-increased by one, and executing the next step; if not: judging whether the access times of the state-action pairs are equal to a maximum counting threshold value or not, if so, updating an action value function, executing the next step, and if not, executing the next step;
s5: storing the running state parameter into the success path, performing self-addition of one at the current moment, and returning to the step S3;
s6: acquiring an action value function, selecting an action instruction from the action value function according to a preset initial state parameter, and repeating: and executing the action command to generate a state parameter, selecting the action command according to the state parameter, and obtaining the optimal path of the single robot system when the generated state parameter is equal to a preset target state parameter.
2. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 1, wherein: in step S4, the action value function is updated according to the formula Q (S, a) ═ U (S, a)/h, where Q (S, a) is the action value function, U (S, a) is the stored reward function, and h is the maximum count threshold.
3. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 1, wherein: in step S3, if the running state parameter is equal to the ending state parameter and equal to the target state parameter, a preset number of successful way-finding times is performed by self-adding one before performing step S2.
4. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 3, wherein: in step S2, when updating the value of the search factor, it is first determined whether the number of successful paths is less than the pre-stored minimum number of paths, if so: executing epsilon ' ═ epsilon + eSize x (Minpathnum-PathNum), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, Minpathnum is the minimum path number, and PathNum is the successful path number; if not: executing epsilon ' ═ eSize × (i/eccycle), epsilon represents the exploration factor before updating, epsilon ' represents the exploration factor after updating, and epsilon ' ═ epsilon, wherein eSize is the single updating step length of the prestored exploration factor, i is the iteration number, and eccycle is the changing period of the prestored exploration factor.
5. The method for Q-Learning algorithm-based standalone robot path planning as claimed in claim 1, wherein: in step S3, when comparing the random number with the search factor, if the random number is greater than the search factor, selecting an action command according to a pre-stored probability model; and if the random number is less than or equal to the exploration factor, randomly selecting an action command from a pre-stored action set.
6. The method for Q-Learning algorithm based standalone robot path planning as claimed in claim 5, wherein: in step S3, the probabilistic model selects the formula of the action command as
Figure FDA0002900973710000021
Wherein P (s | a)k) Selecting an action command a under a state parameter skProbability of (a), Q (s, a)k) Is an action command a under a state parameter skThe value of Q of (A) is,
Figure FDA0002900973710000022
the Q values of all action commands under the state parameter s.
CN201910737476.6A 2019-08-09 2019-08-09 Single robot path planning method based on Q-Learning algorithm Expired - Fee Related CN110378439B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910737476.6A CN110378439B (en) 2019-08-09 2019-08-09 Single robot path planning method based on Q-Learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910737476.6A CN110378439B (en) 2019-08-09 2019-08-09 Single robot path planning method based on Q-Learning algorithm

Publications (2)

Publication Number Publication Date
CN110378439A CN110378439A (en) 2019-10-25
CN110378439B true CN110378439B (en) 2021-03-30

Family

ID=68258789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910737476.6A Expired - Fee Related CN110378439B (en) 2019-08-09 2019-08-09 Single robot path planning method based on Q-Learning algorithm

Country Status (1)

Country Link
CN (1) CN110378439B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859099B (en) * 2019-12-05 2021-08-31 马上消费金融股份有限公司 Recommendation method, device, terminal and storage medium based on reinforcement learning
CN111080013B (en) * 2019-12-18 2022-02-22 南方科技大学 Addressing way-finding prediction method, device, equipment and computer readable storage medium
CN113111296A (en) * 2019-12-24 2021-07-13 浙江吉利汽车研究院有限公司 Vehicle path planning method and device, electronic equipment and storage medium
CN113534826B (en) * 2020-04-15 2024-02-23 苏州宝时得电动工具有限公司 Attitude control method and device of self-mobile device and storage medium
CN111594322B (en) * 2020-06-05 2022-06-03 沈阳航空航天大学 Variable-cycle aero-engine thrust control method based on Q-Learning
CN111649758B (en) * 2020-06-16 2023-09-15 华东师范大学 Path planning method based on reinforcement learning algorithm in dynamic environment
CN112015174B (en) * 2020-07-10 2022-06-28 歌尔股份有限公司 Multi-AGV motion planning method, device and system
CN111857081B (en) * 2020-08-10 2023-05-05 电子科技大学 Chip packaging test production linear energy control method based on Q-learning reinforcement learning
CN112327890A (en) * 2020-11-10 2021-02-05 中国海洋大学 Underwater multi-robot path planning based on WHCA algorithm
CN112595326A (en) * 2020-12-25 2021-04-02 湖北汽车工业学院 Improved Q-learning path planning algorithm with fusion of priori knowledge
CN113062601B (en) * 2021-03-17 2022-05-13 同济大学 Q learning-based concrete distributing robot trajectory planning method
CN114518758B (en) * 2022-02-08 2023-12-12 中建八局第三建设有限公司 Indoor measurement robot multi-target point moving path planning method based on Q learning
CN117634548A (en) * 2024-01-26 2024-03-01 西南科技大学 Unmanned aerial vehicle behavior tree adjustment and optimization method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108594803A (en) * 2018-03-06 2018-09-28 吉林大学 Paths planning method based on Q- learning algorithms
US20180354126A1 (en) * 2017-06-07 2018-12-13 Fanuc Corporation Controller and machine learning device
CN109445440A (en) * 2018-12-13 2019-03-08 重庆邮电大学 The dynamic obstacle avoidance method with improvement Q learning algorithm is merged based on sensor

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819264B (en) * 2012-07-30 2015-01-21 山东大学 Path planning Q-learning initial method of mobile robot
CN105137967B (en) * 2015-07-16 2018-01-19 北京工业大学 The method for planning path for mobile robot that a kind of depth autocoder is combined with Q learning algorithms
CN107317756A (en) * 2017-07-10 2017-11-03 北京理工大学 A kind of optimal attack paths planning method learnt based on Q
CN108762249B (en) * 2018-04-26 2019-11-08 常熟理工学院 Clean robot optimum path planning method based on the optimization of approximate model multistep
CN109933086B (en) * 2019-03-14 2022-08-30 天津大学 Unmanned aerial vehicle environment perception and autonomous obstacle avoidance method based on deep Q learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180354126A1 (en) * 2017-06-07 2018-12-13 Fanuc Corporation Controller and machine learning device
CN108594803A (en) * 2018-03-06 2018-09-28 吉林大学 Paths planning method based on Q- learning algorithms
CN109445440A (en) * 2018-12-13 2019-03-08 重庆邮电大学 The dynamic obstacle avoidance method with improvement Q learning algorithm is merged based on sensor

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A Deterministic Improved Q-Learning for Path Planning of a Mobile Robot;Amit Konar et al;《IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS》;20130930;第43卷(第5期);1141-1153 *
Incremental Multi-Step Q-Learning;Jing Peng and Ronald J. Williams;《Machine Learning》;19961231;283-290 *
改进Q-Learning算法在路径规划中的应用;高乐等;《吉林大学学报( 信息科学版)》;20180731;第36卷(第4期);439-443 *

Also Published As

Publication number Publication date
CN110378439A (en) 2019-10-25

Similar Documents

Publication Publication Date Title
CN110378439B (en) Single robot path planning method based on Q-Learning algorithm
CN110321666B (en) Multi-robot path planning method based on priori knowledge and DQN algorithm
Jiang et al. Path planning for intelligent robots based on deep Q-learning with experience replay and heuristic knowledge
CN102819264B (en) Path planning Q-learning initial method of mobile robot
CN110883776B (en) Robot path planning algorithm for improving DQN under quick search mechanism
CN110632922B (en) Path planning method based on bat algorithm and reinforcement learning
CN111982125A (en) Path planning method based on improved ant colony algorithm
CN113296520A (en) Routing planning method for inspection robot by fusing A and improved Hui wolf algorithm
CN115167478B (en) Robot map-free path planning method and system based on deep reinforcement learning
CN109540163A (en) A kind of obstacle-avoiding route planning algorithm combined based on differential evolution and fuzzy control
CN114756027A (en) Mobile robot path planning method based on improved ant colony algorithm and Bezier curve
CN113848911B (en) Mobile robot global path planning method based on Q-learning and RRT
CN116147627A (en) Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation
CN115129064A (en) Path planning method based on fusion of improved firefly algorithm and dynamic window method
CN117471919A (en) Robot path planning method based on improved pelican optimization algorithm
Shill et al. Genetic algorithm based fully automated and adaptive fuzzy logic controller
Du et al. Application of an improved whale optimization algorithm in time-optimal trajectory planning for manipulators
CN116578080A (en) Local path planning method based on deep reinforcement learning
Wang et al. Q-learning-based Collision-free Path Planning for Mobile Robot in Unknown Environment
Sun Path Planning of Mobile Robot Based on Improved Ant Colony Algorithm
Tang et al. Reinforcement learning for robots path planning with rule-based shallow-trial
Li et al. A novel path planning algorithm based on Q-learning and adaptive exploration strategy
Zhang et al. Using partial-policy q-learning to plan path for robot navigation in unknown enviroment
Yu et al. An intelligent robot motion planning method and application via lppo in unknown environment
CN113503878B (en) Unmanned ship path planning method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210330