LU101606B1

LU101606B1 - Path planning method and system based on combination of safety evacuation signs and reinforcement learning

Info

Publication number: LU101606B1
Application number: LU101606A
Authority: LU
Inventors: Lei Lv; Limei Zhou; Xiukai Zhao; Chen Lv; Guijuan Zhang; Hong Liu
Original assignee: Univ Shandong
Priority date: 2019-04-11
Filing date: 2020-01-27
Publication date: 2020-05-27
Also published as: CN109974737B; CN109974737A

Abstract

The present disclosure provides a path planning method and system based on a combination of safety evacuation signs and reinforcement learning. The path planning method comprises: establishing and rasterizing a two-dimensional simulation scenario model, and initializing obstacles, agents and safety evacuation signs in the two-dimensional simulation scenario model; and performing path planning in combination with the safety evacuation signs and a Q-Learning algorithm, specifically: initializing Q values corresponding to respective agents in a Q value table to 0; acquiring state information of each agent at the current moment, calculating a corresponding reward, and selecting an action having a corresponding large Q value to move each agent; calculating an instant reward of each agent moved to the new location, updating the Q value table, judging whether the Q value table converges, and if so, obtaining an optimal path sequence; otherwise, receiving and aggregating input environmental information sent by each agent and its corresponding state, action, reward and output environmental information, then distributing the aggregated information to each agent, and continuing to move each agent.

Description

PATH PLANNING METHOD AND SYSTEM BASED ON COMBINATION OF SAFETY 01606

EVACUATION SIGNS AND REINFORCEMENT LEARNING Field of the Invention The present disclosure belongs to the field of path planning, and particularly relates to a path planning method and system based on a combination of safety evacuation signs and reinforcement learning. Background of the Invention This section merely provides background information related to the present disclosure and does not necessarily constitute the prior art. In recent years, with the rapid development of China’s urbanization process, the number and scale of buildings in urban public places have also continuously expanded, which means that the safety pressure we have to bear has also increased. How to realistically and quickly simulate an evacuation path of crowds when an accident occurs in a public place has become an important issue to be solved urgently. The simulated crowd evacuation path can assist the safety department in predicting the crowd evacuation process when the accident occurs, and then proposing an effective motion planning solution to shorten the evacuation time and reduce the number of casualties. | 20 The inventors found that the existing more mature motion planning algorithms include A-star algorithm, artificial potential energy algorithm, cellular automaton, simulated annealing : algorithm, genetic algorithm, reinforcement learning algorithm, etc., which cannot quickly adapt to and learn intricate environments and make timely responses, resulting in the problems of low path planning efficiency and poor accuracy. | 25 Summary of the invention In order to solve the above problems, a first aspect of the present disclosure provides a path | planning method based on a combination of safety evacuation signs and reinforcement learning, , where safety evacuation signs and reinforcement learning are combined, no environmental model is required, agents continuously learn and perceive the state of an environment through a trial and error mechanism of reinforcement learning, and the safety evacuation signs provide guidance, so that an optimal path in a complex environment can be quickly found.

In order to achieve the above objective, the present disclosure adopts the following technich? 1606 solution: A path planning method based on a combination of safety evacuation signs and reinforcement learning, including: step 1: establishing and rasterizing a two-dimensional simulation scenario model, and initializing obstacles, agents and safety evacuation signs in the two-dimensional simulation scenario model; and - step 2: performing path planning in combination with the safety evacuation signs and a Q-Learning algorithm; wherein the specific process of step 2 is: step 2.1: initializing Q values corresponding to respective agents in a Q value table to 0; step 2.2: acquiring state information of each agent at the current moment, calculating a corresponding reward, and selecting an action having a corresponding large Q value to move each agent; step 2.3: calculating an instant reward of each agent moved to the new location, updating the Q value table, judging whether the Q value table converges, and if so, obtaining an optimal path sequence; otherwise, proceeding to next step; and step 2.4: receiving and aggregating input environmental information sent by each agent and its corresponding state, action, reward and output environmental information, then distributing the aggregated information to each agent to achieve information sharing, and turning to step 2.2.

| In order to solve the above problems, a second aspect of the present disclosure provides a path planning system based on a combination of safety evacuation signs and reinforcement learning, : where safety evacuation signs and reinforcement learning are combined, no environmental model is required, agents continuously learn and perceive the state of an environment through a trial and error mechanism of reinforcement learning, and the safety evacuation signs provide guidance, so that an optimal path in a complex environment can be quickly found.

: In order to achieve the above objective, the present disclosure adopts the following technical solution: A path planning system based on a combination of safety evacuation signs and reinforcement learning, including: a two-dimensional simulation scenario model initializing module, configured to establish and mT a TEErasterize a two-dimensional simulation scenario model, and initialize obstacles, agents and sat) 1606 evacuation signs in the two-dimensional simulation scenario model; and a path planning module, configured to perform path planning in combination with the safety evacuation signs and a Q-Learning algorithm; wherein the path planning module includes: a Q value table initializing module, configured to initialize Q values corresponding to respective agents in a Q value table to 0; an agent moving module, configured to acquire state information of each agent at the current moment, calculate a corresponding reward, and select an action having a corresponding large Q value to move each agent; a Q value table convergence judging module, configured to calculate an instant reward of each agent moved to the new location, update the Q value table, judge whether the Q value table converges, and obtain an optimal path sequence when the Q value table converges; and | an information sharing module, configured to receive and aggregate, when the Q value table does not converge, input environmental information sent by each agent and its corresponding state, action, reward and output environmental information, then distribute the aggregated information to each agent to achieve information sharing, continue to move each agent according to the Q value to update the Q value table, and judge whether the updated Q value table converges. In order to solve the above problems, a third aspect of the present disclosure provides a computer-readable storage medium, where safety evacuation signs and reinforcement learning are combined, no environmental model is required, agents continuously learn and perceive the state of an environment through a trial and error mechanism of reinforcement learning, and the safety evacuation signs provide guidance, so that an optimal path in a complex environment can be quickly found.

In order to achieve the above objective, the present disclosure adopts the following technical solution: A computer-readable storage medium, storing a computer program thereon, wherein when the program is executed by a processor, the steps in the path planning method based on a combination of safety evacuation signs and reinforcement learning is implemented.

In order to solve the above problems, a fourth aspect of the present disclosure provides a computer device, where safety evacuation signs and reinforcement learning are combined, no

| environmental model is required, agents continuously learn and perceive the state of 14101606 environment through a trial and error mechanism of reinforcement learning, and the safety evacuation signs provide guidance, so that an optimal path in a complex environment can be quickly found.

In order to achieve the above objective, the present disclosure adopts the following technical solution: A computer device, including a memory, a processor, and a computer program stored on the | memory and executable on the processor, wherein when the processor executes the program, the steps in the path planning method based on a combination of safety evacuation signs and reinforcement learning is implemented.

Beneficial effects of the present disclosure: (1) In the present disclosure, safety evacuation signs and reinforcement learning are combined, no environmental model is required, agents continuously learn and perceive the state of an environment through a trial and error mechanism of reinforcement learning, and the safety evacuation signs provide guidance, so that an optimal path in a complex environment can be quickly found. (2) Due to the lack of prior knowledge, the path found by reinforcement learning in the initial iteration process is often not optimal.

To solve this problem, multi-agent information sharing is | used to expand the environmental information grasping region, improve the search efficiency and reduce the time to arrive at the destination.

Brief Description of the Drawings The accompanying drawings constituting a part of the present disclosure are used for providing a further understanding of the present disclosure, and the schematic embodiments of the present | disclosure and the descriptions thereof are used for interpreting the present disclosure, rather than | constituting improper limitations to the present disclosure. | | Fig. 1 is a flowchart of a path planning method based on a combination of safety evacuation signs | and reinforcement learning according to an embodiment of the present disclosure. | Fig. 2 is a two-dimensional modeling effect diagram according to an embodiment of the present | disclosure. | Fig. 3 is a schematic diagram of setting the locations of safety evacuation signs according to an | embodiment of the present disclosure. |

| Fig. 4 is a diagram of a path planning process combining safety evacuation signs and 4101606 Q-Learning algorithm according to an embodiment of the present disclosure.

Fig. 5 is an interaction process diagram of an intelligent sports environment according to an embodiment of the present disclosure.

Fig. 6 is a schematic diagram of agent information sharing according to an embodiment of the present disclosure.

Fig. 7 is a schematic structural diagram of a path planning system based on a combination of | safety evacuation signs and reinforcement learning according to an embodiment of the present | disclosure. | Fig. 8 is a schematic structural diagram of a path planning module according to an embodiment of | the present disclosure. { Fig. 9 is a principle diagram of an information sharing module according to an embodiment of the | present disclosure. { Detailed Description of Embodiments The present disclosure will be further illustrated below in conjunction with the accompanying | drawings and embodiments.

It should be pointed out that the following detailed descriptions are all exemplary and aim to / further illustrate the present disclosure.

Unless otherwise specified, all technological and scientific ] terms used here have the same meanings generally understood by those of ordinary skill in the art of the present disclosure.

It should be noted that the terms used herein are merely for describing specific embodiments, but / are not intended to limit exemplary embodiments according to the present disclosure.

As used herein, unless otherwise explicitly pointed out by the context, the singular form is also intended to | include the plural form.

In addition, it should also be understood that when the terms “include” / and/or “comprise” are used in the Description, they indicate features, steps, operations, devices, ; components and/or their combination. ; Embodiment 1 / As shown in Fig. 1, a path planning method based on a combination of safety evacuation signs | and reinforcement learning in this embodiment includes: ; Step 1: establishing and rasterizing a two-dimensional simulation scenario model, and initializing |obstacles, agents and safety evacuation signs in the two-dimensional simulation scenario model. 14101606 In order to improve the authenticity, the virtual environment is established based on real scenario data of a shopping mall.

The virtual environment is defined as a region of M*N, then the region is rasterized, and each grid is numbered.

Each grid is represented by (x,,y;), x; represents the row where the grid is located, and y, represents the column where the grid is located.

M and N are both positive integers. | | In step 1, the process of initializing obstacles, agents and safety evacuation signs in the | two-dimensional simulation scenario model includes: | defining the agents as mass points having mass but no volume, and setting a circular region of a | preset radius centered on the agents as a collision detection region; | setting the number, location and region size of the obstacles; and / setting the number, location, region size and indicated content of the safety evacuation signs.

The | two-dimensional modeling effect diagram is shown in Fig. 2. | Setting rules for safety evacuation signs include: 1 (Command setting rules for evacuation instructions are as follows: / T indicates going straight; © indicates going left; —> indicates going right; X indicates no ; passing; 7 indicates forward or backward; <> indicates going left or right; E indicates . turning left; E indicates turning right; and safety evacuation signs and commands are stored into : a database correspondingly. | Location setting rules for safety evacuation signs are as follows: placing a preset number of safety evacuation signs at crowded regions, mall entrances and exits, and mall corners to prevent congestion; placing a preset number of safety evacuation signs in remote regions to prevent people from being : trapped; and placing no-passing signs in room centers and no entry regions where potential safety hazards : exist. ; The placement in other regions should comply with the general rules for setting safety signs. / For example: | Straight, left or right turning safety evacuation signs are mostly set at crowded regions, entrances, /

exits, and corners, so that people can quickly choose here to avoid congestion; straight, left 5101606right turning safety evacuation signs are mostly set in remote regions to prevent people unfamiliarwith the paths from being trapped and unable to flee the scenes; no-passing safety evacuation signs are set at special locations where potential safety hazards exist and which are not open to the | public to prevent accidents; safety evacuation signs are reasonably set in other places of scenes | based on the real scenes in accordance with the general rules for setting safety signs.

The position | setting is shown in Fig. 3, where in addition to the basic safety evacuation indication directions | mentioned in the figure, superposed directions of the basic directions are also included, and | details are not described herein. | The crowded regions and the remote regions both simulate real scenarios; the crowded regions mean that the people flow p exceeds a preset flow ptl; and the remote regions mean that the 1 people flow P may be preset to be less than a preset flow pt2 and the distance from the boundary of the two-dimensional simulation scenario model does not exceed a preset distance, where pt2 is smaller than ptl. / Step 2: performing path planning in combination with the safety evacuation signs and a 1 Q-Learning algorithm. . Reinforcement learning mainly indicates that the agents continuously try and make mistakes in a virtual environment, and the learning strategy is adjusted using the reward value fed back by the | environment to maximize the cumulative reward value obtained during the learning process, so as | to achieve the goal of optimizing each step of action.

Naturally, the final output path is an optimal | path.

When the reward value fed back by the environment in which the agents that perform an | action is positive, it means that the tendency of the action to be performed is large, otherwise, the tendency of the action to be performed is small. / In the initial state, because the agents know nothing about environmental information, they need to learn independently.

The initial action of each agent is selected randomly.

When a tum of ' iteration of the reinforcement learning is completed in combination with the safety evacuation ; signs, if the agents have accumulated some experience, the agents share resource information, and then the information obtained by the agents is used as their own experience for learning.

When | encountering the same state as the obtained information in the subsequent iteration, the agents | may perform the action having maximum reward value, and update their Q values. ; As shown in Fig. 4, in step 2, the specific process of performing path planning in combination : Emr ee ee eewith the safety evacuation signs and a Q-Learning algorithm is as follows: lu101606 step 2.1: initializing Q values corresponding to respective agents in a Q value table to 0;

step 2.2: acquiring state information of each agent at the current moment, calculating a | corresponding reward, and selecting an action having a corresponding large Q value to move each | agent; | step 2.3: calculating an instant reward of each agent moved to the new location, updating the Q | value table, judging whether the Q value table converges, and if so, obtaining an optimal path | sequence; otherwise, proceeding to next step; and step 2.4: receiving and aggregating input environmental information sent by each agent and its | corresponding state, action, reward and output environmental information, then distributing the : aggregated information to each agent to achieve information sharing, and turning to step 2.2. : The reinforcement learning algorithm is an on-line learning method different from supervised | learning and unsupervised learning.

The reinforcement learning algorithm uses the agents to ; interact with the environment through state awareness, action selection, and reward reception, and ; the process is shown in Fig. 5. At each step, the agent observes the state of the environment, and ) selects and performs an action to change its state and receive a reward.

Each exploration of the l agent from the starting point to the end point is referred to as an iteration.

After many iterations, it | means that the learning ability of the agent becomes stronger and stronger, so the final result is the | optimal strategy.

Q-Learning algorithm, as one of the reinforcement learning algorithms, is ) defined as follows: | (5,4) (5, 0,) + afi + 7 max, Ols,.1, 0.1) Ws, 4,)] } Where TT max, 0(s,,;,49,,,) is areal Q value, denoted by Q,,,(5,,4,.,); | Q(s,,a,) is an estimated Q value, denoted by Q,,(5,,4,,1); ¥ is an attenuation value of a future | reward, 0<y <1; «a is learning efficiency, 0<a <1, determining how many errors are to be | learned this time; s, is output state information at time f, a, is an action performed at time f, 7; ] is a reward obtained at time #, s,,, is output state information at time #+1, and a, is an action Ë performed at time #+1. Ë The above formula is: | Ore (5023) = Dia (5,18) + A * (Dr 51211) Qu (512,12) :

| 9 Where Q,,(s,,a,) represents an old Q value, and Q,,,(s,,4,) represents a new Q value. 0197606 This embodiment applies the safety evacuation signs and the reinforcement learning algorithm topath planning.

In this process, an action set A of the agents is divided into three parts: basic | actions Al, group actions A2, and optimal actions A3, denoted by A=(A1,A2,A3). The basic | actions Al are eight short actions of each agent, denoted by Al=(up, down, left, right, ul, dl, ur, | dr); | The up, down, left, right, ul, dl, ur, dr indicate up, down, left, right, upper left, lower left, upper | right, and lower right motions, respectively. | Group actions A2 are long actions of the agents following the group.

Optimal actions A3 are eight | basic indication long actions of the agents following the safety evacuation signs, denoted by: | A3 = (forward, go-l, go-r, stop, fwd or dwbk, go-1 or go-r, turn-l, turn-r) is a state set S, indicating ; each step of the agents.

The forward, go-1, go-r, stop, fwd or dwbk, go-1 or go-r, turn-1, turn-r indicate forward, going left, | going right, stop, forward or backward, going left or right, turning left, and turning right, ; respectively. | The learning process of motion planning in combination with the safety evacuation signs and a à Q-Learning algorithm is as follows: É 1) initializing O(s,a) to 0, where VseS, ae 4(s); 3 2) observing, by the agents, state information s; at time #; | 3) selecting, by the agents, an action a: having maximum Q value according to the current state J and the reward value 7; for moving; ] 4) determining that, when the action selected by the agents acts on the environment, the state of ) the environment changes: } that is, the current position changes to next new position sx, an instant reward r is given, and the | r: here is defined as follows: | 2, arrive at the exit | 1, optimal action | r= 0, group action | -1, basic action (negative for the agent to quickly find a path without lingering A -2, collide with obstacles or other agents |

5) updating the Q table: O(s,,a,) —OQ(s,,a,) + a[r,, + rmax@(s,..,a)—Q(s,,a,)], where de 01606 | © | value of given y here is 0.8; judging whether the Q value table converges, and if so, stopping | the cycle to obtain an optimal path sequence; otherwise, performing next step; | 6) receiving and aggregating input environmental information sent by each agent and its | corresponding state, action, reward and output environmental information, then distributing the | aggregated information to each agent to achieve information sharing, and turning to step 2). Since this embodiment simulates real motion of crowds in a shopping mall, the crowds are numerous agents.

The agents cannot exist independently, because in an evacuation scenario, individual motions do not conform to human group characteristics.

In addition, a single agent cannot efficiently complete tasks, the limited scenario resources mastered by the single agent may ‘ slow the learning process of the agent and prolong the output time of the optimal path, and even / the target task cannot be completed at worst.

Accordingly, before next reinforcement learning à iteration, the agents output the environmental information obtained by their own reinforcement ; learning to a headquarters information processor, and then the headquarters information processor ] sends the aggregated information to each agent.

In this way, information sharing among multiple | agents is completed, where the shared information includes strategy, experience, and | environmental state.

Fach agent then updates its own resources according to the information : obtained from the headquarters information processor, and determines the action strategy in next | iteration by considering its own Q value and historical strategy, as shown in Fig. 6. ; In this embodiment, safety evacuation signs and reinforcement learning are combined, no | environmental model is required, agents continuously learn and perceive the state of an : environment through a trial and error mechanism of reinforcement learning, and the safety } evacuation signs provide guidance, so that an optimal path in a complex environment can be ; quickly found. | By adopting multi-agent information sharing in this embodiment, the environmental information | grasping region is enlarged, the search efficiency is improved, and the time to arrive at the ; destination is reduced. | Embodiment 2 | As shown in Fig. 7, this embodiment provides a path planning system based on a combination of | safety evacuation signs and reinforcement learning, which is characterized by including: '

11 |

(1) A two-dimensional simulation scenario model initializing module, configured to establish and 01609 | rasterize a two-dimensional simulation scenario model, and initialize obstacles, agents and safety | evacuation signs in the two-dimensional simulation scenario model. ;

In order to improve the authenticity, the virtual environment is established based on real scenario | data of a shopping mall.

Each grid is represented by (x,,y,), x, represents the row where the grid is located, and y, represents the column where the grid is located.

M and N ; are both positive integers. ' In step 1, the process of initializing obstacles, agents and safety evacuation signs in the | two-dimensional simulation scenario model includes: | defining the agents as mass points having mass but no volume, and setting a circular region of a | preset radius centered on the agents as a collision detection region; | setting the number, location and region size of the obstacles; and | setting the number, location, region size and indicated content of the safety evacuation signs.

The ;

two-dimensional modeling effect diagram is shown in Fig. 2. | Setting rules for safety evacuation signs include: ) Command setting rules for evacuation instructions are as follows: :

Ÿ indicates going straight; <— indicates going left; —> indicates going right; X indicates no } passing; J indicates forward or backward; <> indicates going left or right; Ë indicates |turning left; B indicates turning right; and safety evacuation signs and commands are stored into ] a database correspondingly. } Location setting rules for safety evacuation signs are as follows: | placing a preset number of safety evacuation signs at crowded regions, mall entrances and exits, |and mall corners to prevent congestion; |placing a preset number of safety evacuation signs in remote regions to prevent people from being | trapped; and ' placing no-passing signs in room centers and no entry regions where potential safety hazards . exist. }

The placement in other regions should comply with the general rules for setting safety signs. |

For example: Ë

Straight, left or right turning safety evacuation signs are mostly set at crowded regions, entrances, 01605 | exits, and corners, so that people can quickly choose here to avoid congestion; straight, left or right turning safety evacuation signs are mostly set in remote regions to prevent people unfamiliar | with the paths from being trapped and unable to flee the scenes; no-passing safety evacuation | signs are set at special locations where potential safety hazards exist and which are not open to the J public to prevent accidents; safety evacuation signs are reasonably set in other places of scenes . based on the real scenes in accordance with the general rules for setting safety signs.

The position : setting is shown in Fig. 3, where in addition to the basic safety evacuation indication directions mentioned in the figure, superposed directions of the basic directions are also included, and ; details are not described herein. | The crowded regions and the remote regions both simulate real scenarios; the crowded regions Ë mean that the people flow p exceeds a preset flow ptl; and the remote regions mean that the ; people flow P may be preset to be less than a preset flow pt2 and the distance from the boundary | of the two-dimensional simulation scenario model does not exceed a preset distance, where pt2 is 4 smaller than ptl. ; (2) A path planning module, configured to perform path planning in combination with the safety | evacuation signs and a Q-Learning algorithm, | Reinforcement learning mainly indicates that the agents continuously try and make mistakes in a | virtual environment, and the learning strategy is adjusted using the reward value fed back by the ; environment to maximize the cumulative reward value obtained during the learning process, so as } to achieve the goal of optimizing each step of action.

Naturally, the final output path is an optimal ! path.

When the reward value fed back by the environment in which the agents that perform an Ë action is positive, it means that the tendency of the action to be performed is large, otherwise, the | tendency of the action to be performed is small. : In the initial state, because the agents know nothing about environmental information, they need ; to learn independently.

The initial action of each agent is selected randomly.

When a turn of ! iteration of the reinforcement learning is completed in combination with the safety evacuation : signs, if the agents have accumulated some experience, the agents share resource information, and ] then the information obtained by the agents is used as their own experience for learning.

When ( encountering the same state as the obtained information in the subsequent iteration, the agents : may perform the action having maximum reward value, and update their Q values. ;

13 | As shown in Fig. 8, the path planning module includes: ‘101608 (2.1) a Q value table initializing module, configured to initialize Q values corresponding to | respective agents in a Q value table to 0; 1 (2.2) an agent moving module, configured to acquire state information of each agent at the current / moment, calculate a corresponding reward, and select an action having a corresponding large Q : value to move each agent; | (2.3) a Q value table convergence judging module, configured to calculate an instant reward of ; each agent moved to the new location, update the Q value table, judge whether the Q value table . converges, and obtain an optimal path sequence when the Q value table converges; and ; (2.4) an information sharing module, configured to receive and aggregate, when the Q value table | does not converge, input environmental information sent by each agent and its corresponding state, | action, reward and output environmental information, then distribute the aggregated information to | each agent to achieve information sharing, continue to move each agent according to the Q value to : update the Q value table, and judge whether the updated Q value table converges. | The reinforcement learning algorithm is an on-line learning method different from supervised Ë learning and unsupervised learning.

The reinforcement learning algorithm uses the agents to ’ interact with the environment through state awareness, action selection, and reward reception, and i the process is shown in Fig. 5. At each step, the agent observes the state of the environment, and | selects and performs an action to change its state and receive a reward.

Each exploration of the ; agent from the starting point to the end point is referred to as an iteration.

After many iterations, it | means that the learning ability of the agent becomes stronger and stronger, so the final result is the ] optimal strategy.

Q-Learning algorithm, as one of the reinforcement learning algorithms, is i defined as follows: i Osa) —O(s,,4,) + alt + ymax, | Os, au) Qs,» a,)] | L+Y max, (Xs,;,4,) IS a real Q value, denoted by Q eur (54241) 3 É Where, Ë Q(s,,a,) is an estimated Q value, denoted by Q,(5,,0,,1); 7 is an attenuation value of a future | reward, 0<y <1; a is learning efficiency, 0<a <1, determining how many errors are to be ; learned this time; s, is output state information at time #, a, is an action performed at time 7, 7; is a reward obtained at time #, 5,,, is output state information at time +1, and a,,, is an action |

| oT 14 performed at time t+1. lu101606

The above formula is: |

Or (5:58) = 0a (8,4) + A (Do (518) Om (5,1) |

Where O,,(s,,9,) represents an old Q value, and Q,,,(s,,a,) represents a new Q value. |

This embodiment applies the safety evacuation signs and the reinforcement learning algorithm to | path planning.

In this process, an action set A of the agents is divided into three parts: basic | actions Al, group actions A2, and optimal actions A3, denoted by A= (A1, A2, A3). The basic ; actions Al are eight short actions of each agent, denoted by A1= (up, down, left, right, ul, dl, ur, |dr); | The up, down, left, right, ul, dl, ur, dr indicate up, down, left, right, upper left, lower left, upper ' right, and lower right motions, respectively. | Group actions A2 are long actions of the agents following the group.

Optimal actions A3 are eight Ébasic indication long actions of the agents following the safety evacuation signs, denoted by: | A3 = (forward, go-l, go-r, stop, fwd or dwbk, go-1 or go-r, turn-1, turn-r) is a state set S, indicating |each step of the agents. | The forward, go-l, go-r, stop, fwd or dwbk, go-1 or go-r, turn-], turn-r indicate forward, going left, 7 going right, stop, forward or backward, going left or right, turning left, and turning right, | respectively. ;

The learning process of motion planning in combination with the safety evacuation signs and a i

2) observing, by the agents, state information s; at time £; |

3) selecting, by the agents, an action a; having maximum Q value according to the current state ;

and the reward value 7; for moving; ]

4) determining that, when the action selected by the agents acts on the environment, the state of E the environment changes: Ëthat is, the current position changes to next new position 5-1, an instant reward 7; is given, and the ,

r+ here is defined as follows: | Ee,

15 | 2, arrive at the exit lu101606 | 1, optimal action | r=< 0, group action | -1, basic action (negative for the agent to quickly find a path without lingering | -2, collide with obstacles or other agents | 5) updating the Q table: Q(s,,4,) —— sa) + alt + ymax 906...,a) — Q(s,,a,)] , where the | a ; value of given y here is 0.8; judging whether the Q value table converges, and if so, stopping | the cycle to obtain an optimal path sequence; otherwise, performing next step; | 6) receiving and aggregating input environmental information sent by each agent and its | corresponding state, action, reward and output environmental information, then distributing the | aggregated information to each agent to achieve information sharing, and turning to step 2). Ë Since this embodiment simulates real motion of crowds in a shopping mall, the crowds are . numerous agents.

The agents cannot exist independently, because in an evacuation scenario, | individual motions do not conform to human group characteristics.

In addition, a single agent | cannot efficiently complete tasks, the limited scenario resources mastered by the single agent may | slow the learning process of the agent and prolong the output time of the optimal path, and even ! the target task cannot be completed at worst.

Accordingly, before next reinforcement learning Ë iteration, the agents output the environmental information obtained by their own reinforcement : learning to a headquarters information processor, and then the headquarters information processor ; sends the aggregated information to each agent.

In this way, information sharing among multiple ! | agents is completed, where the shared information includes strategy, experience, and | environmental state.

Fach agent then updates its own resources according to the information | obtained from the headquarters information processor, and determines the action strategy in next E iteration by considering its own Q value and historical strategy, as shown in Fig. 6. ; In the specific implementation process, the information sharing module includes a main processor ; for agents and a headquarters information controller.

The main processor for agents is configured j to input environmental information (such as distances and angles between agents and obstacles ; | and between agents and safety evacuation signs in the current state, and content information of the | safety evacuation signs), output state s,, action a,, reward 7, and environmental information, . and manage its own information.

The headquarters information processor is configured to . ; TEE EEE EE.

| 16 | aggregate the information shared by each agent, and then distribute the information to each agent 01606 so as to achieve information sharing for performing next iteration quickly, as shown in Fig. 9.

In this embodiment, safety evacuation signs and reinforcement learning are combined, no environmental model is required, agents continuously learn and perceive the state of an environment through a trial and error mechanism of reinforcement learning, and the safety evacuation signs provide guidance, so that an optimal path in a complex environment can be quickly found. By adopting multi-agent information sharing in this embodiment, the environmental information | grasping region is enlarged, the search efficiency is improved, and the time to arrive at the | destination is reduced. Embodiment 3 ] This embodiment provides a computer-readable storage medium, storing a computer program ) thereon. When the program is executed by a processor, the steps in the path planning method | based on a combination of safety evacuation signs and reinforcement learning as shown in Fig. 1 is implemented. ; In this embodiment, safety evacuation signs and reinforcement learning are combined, no / environmental model is required, agents continuously learn and perceive the state of an | environment through a trial and error mechanism of reinforcement learning, and the safety | evacuation signs provide guidance, so that an optimal path in a complex environment can be / quickly found. ; By adopting multi-agent information sharing in this embodiment, the environmental information € grasping region is enlarged, the search efficiency is improved, and the time to arrive at the | destination is reduced. | Embodiment 4 : This embodiment provides a computer device, including a memory, a processor, and a computer l program stored on the memory and executable on the processor. When the processor executes the à | program, the steps in the path planning method based on a combination of safety evacuation signs É and reinforcement learning as shown in Fig. 1 is implemented. ; In this embodiment, safety evacuation signs and reinforcement learning are combined, no | environmental model is required, agents continuously learn and perceive the state of an ] | environment through a trial and error mechanism of reinforcement learning, and the safety ; a a OOevacuation signs provide guidance, so that an optimal path in a complex environment can pl 01606 quickly found. By adopting multi-agent information sharing in this embodiment, the environmental information grasping region is enlarged, the search efficiency is improved, and the time to arrive at the destination is reduced. It should be understood by those skilled in the art that the embodiments of the present disclosure may be provided as a method, a system, or a computer program product. Therefore, the present disclosure may be in the form of hardware embodiments, software embodiments, or embodiments | combining software and hardware. In addition, the present disclosure may be in the form of a | computer program product implemented on one or more computer available storage media | (including but not limited to a disk memory, an optical memory, and the like) including computer available program codes. | The present disclosure is described with reference to flow diagrams and/or block diagrams of the method, device (system), and the computer program product in the embodiments of the present / disclosure. It should be understood that computer program instructions can implement each process | and/or block in the flowcharts and/or block diagrams and a combination of processes and/or blocks in the flowcharts and/or block diagrams. These computer program instructions may be provided to a general-purpose computer, a dedicated computer, an embedded processor, or a processor of other programmable data processing device to generate a machine, so that an apparatus configured to 1 implement functions specified in one or more processes in the flow diagrams and/or one or more : blocks in the block diagrams is generated by using instructions executed by the general-purpose | computer or the processor of other programmable data processing device. : These computer program instructions may also be stored in a computer-readable memory that can 1 guide a computer or another programmable data processing device to work in a specific manner, so : that the instructions stored in the computer-readable memory generate a product including an | instruction apparatus, where the instruction apparatus implements functions specified in one or Ë more processes in the flowcharts and/or one or more blocks in the block diagrams. ] These computer program instructions may also be loaded into a computer or another programmable ; data processing device, so that a series of operation steps are performed on the computer or another Ê programmable data processing device to generate processing implemented by a computer, and ; instructions executed on the computer or another programmable data processing device provide !

PE EE EEsteps for implementing functions specified in one or more processes in the flowcharts and/or on&4P1606 more blocks in the block diagrams. | A person of ordinary skill in the art may understand that all or some of the flows of the method in | the above embodiment may be completed by a program instructing relevant hardware.

The | program may be stored in a computer-readable storage medium.

The program, when executed, | may include the flows of the embodiment of each method described above.

The storage medium | may be a magnetic disk, an optical disk, a Read-Only Memory (ROM) or a Random Access Memory (RAM), etc. | Described above are merely preferred embodiments of the present disclosure, and the present ; disclosure is not limited thereto.

Various modifications and variations may be made to the present | disclosure for those skilled in the art.

Any modification, equivalent substitution or improvement | made within the spirit and principle of the present disclosure shall fall into the protection scope of ) the present disclosure. |

Claims

Claims lu101606

1. A path planning method based on a combination of safety evacuation signs and reinforcement learning, comprising: step 1: establishing and rasterizing a two-dimensional simulation scenario model, and initializing obstacles, agents and safety evacuation signs in the two-dimensional simulation scenario model; and / step 2: performing path planning in combination with the safety evacuation signs and a Q-Learning algorithm; wherein the specific process of step 2 is: step 2.1: initializing Q values corresponding to respective agents in a Q value table to 0; step 2.2: acquiring state information of each agent at the current moment, calculating a corresponding reward, and selecting an action having a corresponding large Q value to move each agent; ; step 2.3: calculating an instant reward of each agent moved to the new location, updating the Q value table, judging whether the Q value table converges, and if so, obtaining an optimal path sequence; otherwise, proceeding to next step; and step 2.4: receiving and aggregating input environmental information sent by each agent and its corresponding state, action, reward and output environmental information, then distributing the aggregated information to each agent to achieve information sharing, and turning to step 2.2.

2. The path planning method based on a combination of safety evacuation signs and | reinforcement learning according to claim 1, wherein in step 2.3, the instant reward of each agent moved to the new location is set as 7: 2, arrive at the exit 1, optimal action | r=< 0, group action -1, basic action (negative for the agent to quickly find a path without lingering -2, collide with obstacles or other agents

3. The path planning method based on a combination of safety evacuation signs and reinforcement learning according to claim 1, wherein in step 1, the process of rasterizing a two-dimensional simulation scenario model is:

defining the two-dimensional simulation scenario model as a region of M*N, rasterizing fhe 100° region, and numbering each grid, where M and N are both positive integers. | 4. The path planning method based on a combination of safety evacuation signs and : reinforcement learning according to claim 1, wherein in step 1, the process of initializing obstacles, agents and safety evacuation signs in the two-dimensional simulation scenario model comprises: defining the agents as mass points having mass but no volume, and setting a circular region of a preset radius centered on the agents as a collision detection region; setting the number, location and region size of the obstacles; and setting the number, location, region size and indicated content of the safety evacuation signs.

5. A path planning system based on a combination of safety evacuation signs and reinforcement learning, comprising: a two-dimensional simulation scenario model initializing module, configured to establish and rasterize a two-dimensional simulation scenario model, and initialize obstacles, agents and safety evacuation signs in the two-dimensional simulation scenario model; and | a path planning module, configured to perform path planning in combination with the safety evacuation signs and a Q-Learning algorithm; wherein the path planning module comprises: : a Q value table initializing module, configured to initialize Q values corresponding to respective agents in a Q value table to 0; an agent moving module, configured to acquire state information of each agent at the current moment, calculate a corresponding reward, and select an action having a corresponding large Q value to move each agent; a Q value table convergence judging module, configured to calculate an instant reward of each agent moved to the new location, update the Q value table, judge whether the Q value table converges, and obtain an optimal path sequence when the Q value table converges; and an information sharing module, configured to receive and aggregate, when the Q value table does not converge, input environmental information sent by each agent and its corresponding state, action, reward and output environmental information, then distribute the aggregated information | 30 to each agent to achieve information sharing, continue to move each agent according to the Q value to update the Q value table, and judge whether the updated Q value table converges.

| 21

6. The path planning system based on a combination of safety evacuation signs and reinforcemt&#? 1606 | learning according to claim 5, wherein in the Q value table convergence judging module, the instant reward of each agent moved to the new location is set as r,: 2, arrive at the exit 1, optimal action ; r=< 0, group action | -1, basic action (negative for the agent to quickly find a path without lingering -2, collide with obstacles or other agents

7. The path planning system based on a combination of safety evacuation signs and reinforcement learning according to claim 5, wherein in the Q value table, the process of rasterizing a two-dimensional simulation scenario model is: : defining the two-dimensional simulation scenario model as a region of M*N, rasterizing the region, and numbering each grid, where M and N are both positive integers.

8. The path planning system based on a combination of safety evacuation signs and reinforcement learning according to claim 5, wherein in the Q value table, the process of initializing obstacles, | agents and safety evacuation signs in the two-dimensional simulation scenario model comprises: defining the agents as mass points having mass but no volume, and setting a circular region of a preset radius centered on the agents as a collision detection region; setting the number, location and region size of the obstacles; and setting the number, location, region size and indicated content of the safety evacuation signs.

9. A computer-readable storage medium, storing a computer program thereon, wherein when the | program is executed by a processor, the steps in the path planning method based on a combination of safety evacuation signs and reinforcement learning according to any one of claims 1-4 is implemented.

10. A computer device, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein when the processor executes the program, the steps in the path planning method based on a combination of safety evacuation signs and reinforcement learning according to any one of claims 1-4 is implemented.