CN115879315A

CN115879315A - Crowd emergency evacuation robot model based on confrontation reinforcement learning

Info

Publication number: CN115879315A
Application number: CN202211705931.2A
Authority: CN
Inventors: 赵涵韬; 马天行; 梁志豪; 施晓蒙
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-03-31

Abstract

The invention aims at the problems of path planning and intelligent decision making of the crowd evacuation robot. By modeling the fire environment of a multi-storey building, the building has a perfect building structure and a fire development mode. An intelligent agent behavior model used for simulating limited visual field and having no prior knowledge on building construction is designed by combining crowd behaviors, and the evacuation robot is trained based on a multi-intelligent agent post credit allocation reinforcement learning algorithm. Meanwhile, an antagonistic fire source generator agent is realized and used for deciding the generation position of the next fire source in a fire scene, modeling optimization is finally carried out on the robot behavior by combining an antagonistic reinforcement learning method, a generator-solver mode is formed, and an agent decision model with the optimal evacuation crowd effect is obtained through repeated iterative training.

Description

Crowd emergency evacuation robot model based on confrontation reinforcement learning

Technical Field

The invention belongs to the field of software control, and particularly relates to a crowd emergency evacuation robot model based on countermeasure reinforcement learning.

Background

At present, the crowd evacuation method which is actually put into use is divided into a traditional evacuation mode and a mode of realizing algorithm through a simulated real environment, the traditional mode has a plurality of problems which are difficult to solve, and the simulated evacuation algorithm also has the problems of certain limitation and low efficiency, so that a novel crowd emergency evacuation solution which can break through the limitation of the traditional mode is urgently needed.

The conventional evacuation modes mainly comprise manual command and evacuation signs, and the conventional evacuation modes have obvious limitations. The former is that one or more workers familiar enough to the building structure are responsible for guiding people flow, but the limited number of workers are difficult to play a significant role in a large building with high density of people flow, and the latter is that a sign is placed at a designated position in the building to inform people of the evacuation direction at the current position, but the static sign has a limited effect on the messy people flow, and the fixed route of the sign can have a negative effect on the evacuation work when the environment dynamically changes. In order to make up for many defects of the traditional evacuation mode, researchers simulate evacuation scenes in a building through various human behavior models and modeling tools, and apply an optimization algorithm to a dynamic guiding intelligent body in a virtual scene, so that a novel crowd evacuation solution which is more efficient and saves more manpower resources compared with the traditional method is realized. The implementation of the novel Crowd evacuation solution mainly relates to the aspects of human behavior model construction technology, evacuation organization algorithm and the like, and currently, mainstream human behavior Models (crown Simulation Models) can be divided into a Macroscopic (Macroscopic) model and a microscopic (Microcosmic) model. The macro model focuses on the overall movement of the crowd from the global viewpoint. The current popular micro Model mainly comprises a Cellular automation Model (Cellular automation-Based Model) and a Social Force Model (Social Force Model), wherein the Cellular automation Model is a discrete Model, and the distribution condition of cells at each moment is described by discretizing space-time information and combining elements such as states, cells and rules. The social force model is a continuous model, has definite physical significance, and expresses the influence factors of the environment on the pedestrians by means of definite forces. Current studies in crowd evacuation algorithms can be divided into two categories depending on the presence or absence of evacuation guidance devices. The unguided evacuation algorithm is deployed mainly on human units in the environment and is more used to evaluate the safety of the interior design of a building. The evacuation algorithm with a guiding device deploys the algorithm on the guiding device in the environment, and the algorithm can be mainly divided into two categories, namely evacuation signs with variable guiding directions and movable evacuation robots.

Compared with a common dynamic path planning scheme, the crowd evacuation scheme has the advantages of higher execution speed, higher guidance accuracy, better robustness and universality and the like. The traditional evacuation method has many disadvantages, and the use of the simulated evacuation algorithm needs to improve the universality and the evacuation execution efficiency. Therefore, the dynamic evacuation robot based on the optimization algorithm is developed, so that the dynamic evacuation robot can have the capabilities of people evacuation guides familiar with the environment and dynamic route planning and the characteristic that evacuation signs can be deployed in large quantity, and the dynamic evacuation robot has very important significance for people in a building and building managers responsible for ensuring the life safety of the people.

Disclosure of Invention

To solve the above problems, the present invention discloses an algorithm mainly related to confrontation reinforcement learning, agent simulation, evacuation organization, multi-agent post-event credit allocation reinforcement learning, and the like. The key technology relates to application of fusing human behavior models and an antagonistic reinforcement learning framework. Meanwhile, the technical method in the latest machine learning field is improved, and the problem existing in the training of the optimal behavior decision process of the intelligent robot in the crowd evacuation task at present is solved.

The specific technical scheme is as follows:

s1, modeling a multi-storey building fire environment by using a Unity engine, wherein the multi-storey building fire environment has a perfect building structure and a fire development mode;

s2, designing a human behavior model which is used for simulating limited visual field and has no prior knowledge on building construction by combining the crowd movement model;

s3, training the evacuation robot based on an intelligent agent machine learning framework and a multi-intelligent agent post credit allocation (MA-POCA) reinforcement learning algorithm;

s4, realizing an antagonistic fire source generator agent for deciding the generation position of the next fire source in the fire scene;

s5, modeling the task by using a counterreinforcement learning method, and forming a generator-solver mode;

and S6, performing multiple times of training and data statistics by using a Unity engine, and finally obtaining the robot model with the optimal evacuation crowd effect.

The specific modeling process of step S1 includes the following steps:

s1-1, constructing basic barriers, the ground and other interactive blocks and manufacturing corresponding static component elements.

S1-2, setting penetrability of corresponding component elements and parameters in a NavMesh navigation system.

S1-3, building construction elements, building a building main body to divide the building main body into three layers, and adjusting the range of the building main body to carry out experiments.

S1-4, constructing an independent auxiliary structure, namely a staircase, and connecting each floor of the building main body by using two staircases.

S1-5, dividing rooms in each floor into a plurality of rooms, and connecting the rooms through blue common doors.

S1-6, changing the layout of each floor of room in the simulated escape environment, so that the number and complexity of the rooms are reduced along with the increase of floors.

The construction of the concrete structure of the staircase in the step S1-4 comprises the following steps:

s1-4-1, constructing a small room as the transition from a building main body to a staircase, and connecting the small room with the staircase;

s1-4-2, arranging a relay platform between layers to ensure the consistency of the horizontal position of each layer;

and S1-4-3, setting a recording function to record the self stair direction and the information of the stair opening at the other end.

The human behavior model building process of step S2 includes the following steps:

s2-1, constructing a plurality of teams guided by the leader based on team concepts;

s2-2, taking various gate structures in the visual field as moving targets by a leader, and continuously exploring the surrounding environment; the follower continuously moves to the vicinity of the leader to avoid falling behind;

s2-3, the evacuation robot can also be used as a leader of a team, and the human leader can become a follower of the evacuation robot to obtain the leader of the team;

s2-4, when the leader observes a team with more people, the team becomes a follower of a certain member of the team;

s2-5, when the follower cannot find the leader of the follower in the visual field range due to the reason that the follower falls behind or no other people exist around the generation position and the like, the follower is converted into a leader mode, and the surrounding environment is actively explored;

and S2-6, finally, gradually forming a tree structure team with the top leader as a root node.

The use of the MA-POCA reinforcement learning algorithm of step S3 includes the steps of:

s3-1, loading the simulation training scene constructed in the step S1 by using an MA-POCA algorithm;

s3-2, each intelligent agent transmits own observation, action and reward to a central controller centralized training strategy network and a value network during training;

s3-3, adjusting the batch size, the learning rate and other hyperparameters according to the training feedback to correct the model;

and S3-4, storing and using the trained model to simulate and evaluate the result.

The specific operation logic of the antagonism-based fire source generator of the step S4 is as follows:

s4-1, acquiring environmental parameter information by collecting an observation method;

s4-2, selecting a flame generation position according to the selected strategy and action;

s4-3, judging whether corresponding conditions are met and generating flame

The concrete modeling implementation operation of the step S5 comprises the following steps:

s5-1, calling, collecting, observing and collecting environmental parameter information and sending the environmental parameter information into an algorithm;

s5-2, taking the evacuation robot as a resolver;

s5-3, constructing and setting an action space function and a reward function of the evacuation robot according to the relevant parameters;

s5-4, taking the fire source intelligent agent as a generator;

s5-5, constructing and setting an action space function and a reward function of the fire source intelligent agent according to the relevant parameters;

s5-6, simulating and generating a group of parameters integrated in the environment for evaluating the effect

S5-7, performing feedback adjustment on the resolver and the generator by using the evaluation parameters;

the specific operation of the step S5-1 comprises the following steps:

s5-1-1, acquiring three-dimensional position coordinates of all human beings in the environment;

s5-1-2, acquiring three-dimensional position coordinates of all evacuation robots in the environment;

s5-1-3, acquiring the scale of the evacuation robot team on the layer;

s5-1-4, acquiring a time lapse value after evacuation starts;

s5-1-5, observing all data from S5-2-1 to S5-2-4 as a vector, repeatedly obtaining the vector four times and taking the vector as input;

s5-1-6, encoding each grid in the experimental range into a vector mode to represent entity types contained in the grids;

s5-1-7, compressing the vector obtained in the step S5-1-6 into a picture in a PNG format, taking the length of the vector as the number of channels of the picture, and sending the length of the vector into a double-layer convolutional neural network for feature extraction;

and S5-1-8, taking the characteristics extracted in the step S5-1-7 as input of the observation static elements.

The specific operation of step S5-3 includes the following steps:

s5-3-1, distributing the three evacuation robots to an intelligent agent group of the same MA-POCA algorithm;

and S5-3-2, respectively setting group rewards and individual rewards for the three evacuation robots.

S5-3-3, taking continuous action obtained by model decision as action space

S5-3-4, processing the model decision result as a mobile coordinate;

the specific operation of step S5-5 includes the following steps:

s5-5-1, setting a reward function for the fire source intelligent agent;

s5-5-2, forming the motion space into a discrete motion and two continuous motions;

s5-5-3, dividing the discrete action into three branch options to represent the floor number for producing flame;

and S5-5-4, setting continuous actions consistent with the evacuation robot to represent the coordinates of the flame generation position.

The specific operation of step S5-6 includes the following steps:

s5-6-1, taking evacuation time, mortality, residual blood volume and detention time as evaluation parameters to be merged into the environment;

s5-6-2, updating the evaluation parameters along with the execution of the algorithm;

s5-6-3, sending the updated parameters into a resolver and a generator;

the specific operation of step S5-7 includes the following steps:

s5-7-1, transmitting the related parameters to an evacuation robot reward function to enable the evacuation robot reward function to update action decisions;

s5-7-2, transmitting the related parameters to the fire source intelligent agent reward function to enable the fire source intelligent agent reward function to update action decisions;

step S6, carrying out antagonistic reinforcement learning training and data statistics by using a Unity engine, and finally obtaining a robot model with the optimal evacuation crowd effect, wherein the method comprises the following steps:

s6-1, constructing a simulation scene by using the three-storey building scene, the human agent, the evacuation robot agent and the fire source generation agent which are set in the step;

s6-2, generating a simpler environment by using a generated confrontation reinforcement learning fire source intelligent agent as a generator;

s6-3, the evacuation robot is used as a resolver and quickly and safely guides human evacuation in the building according to a reward and punishment mechanism;

s6-4, the generator feeds back the training result and creates a more challenging fire environment to challenge the evacuation robot;

s6-5, repeating the training process from S6-3 to S6-4 enables the fire source generator to learn to generate the fire source at the position where the environment is more dangerous, and meanwhile, the evacuation robot also improves generalization ability and robustness in various fire environments.

The specific reward and punishment mechanism process of the step S6-3 comprises the following steps:

s6-3-1, group rewards are set as common targets of the three evacuation robots: evacuate all humans as quickly as possible and avoid any human death in the environment.

S6-3-2, whenever a human is successfully evacuated from an exit, the group reward for this is added to the remaining life value of the human at the time of evacuation. The shorter a human remains in a fire environment, the more life remains and the more rewards are given to the group of robotic agents.

S6-3-3, if a human being is dead due to staying in a fire environment for too long or being out of the way because of being blocked by flame, the group reward is deducted reversely, and in this way, the evacuation robot intelligent agent learns to preferentially rescue the human being in a more dangerous state.

S6-3-4, when people not evacuated still exist in the floor, the evacuation robot in charge of the floor is given certain punishment along with time, and the more people are evacuated in the floor, the higher the punishment of each frame is.

S6-3-5, the individual reward function is set to encourage the robot to empty the human beings on the floor as soon as possible, and meanwhile, the robot is encouraged to guide the human teams to the stair openings which do not cause the reduction of the total evacuation efficiency.

The invention has the beneficial effects that:

1. the invention has the main effect of being embodied in the training mode of the behavior of the intelligent agent. In a fire scene, the fire condition is continuously evolving along with the change of time, and the evacuation efficiency of the evacuation robot is limited due to the limitation of the conventional algorithm. According to the method, a dynamic flame change mode is introduced, the condition that the flame gradually spreads and develops from one to a plurality of small fire sources to the whole building in reality is simulated, and an antagonistic fire source generator intelligent body is realized by using an antagonistic reinforcement learning method and is used for deciding the generation position of the next fire source in a fire scene.

2. The fire source generator and the evacuation robot algorithm are trained together by using a neural network, and in the training process, the fire source generator can help the evacuation robot intelligent body to improve adaptability and robustness to different environments by continuously generating various challenging fire development modes. An evacuation robot agent in a virtual environment is trained by using a multi-agent post credit allocation (MA-POCA) reinforcement learning algorithm, so that all human beings can be guided to an exit before a fire disaster is completely spread, and an evacuation task is smoothly completed. By combining the technologies and the methods, the problems of limitation and low efficiency existing in the existing evacuation behavior model of the robot are solved, and the task of efficiently and emergently evacuating people by the robot is realized.

Drawings

FIG. 1 is a schematic technical route of the model of the present invention;

FIG. 2 is an oblique top plan view of a three-storey building simulation environment constructed in embodiment 1 of the present disclosure;

FIG. 3 is an overview of the Navmesh navigation system in a top view obliquely above the simulation environment in embodiment 1 of the present disclosure;

FIG. 4 is a top plan view of a first floor of a simulation environment in accordance with embodiment 1 of the present disclosure;

FIG. 5 is a top view of the second floor of the simulation environment in embodiment 1 of the present disclosure;

fig. 6 is a top view of the third floor of the simulation environment in embodiment 1 of the present disclosure;

FIG. 7 is a schematic structural diagram of an independent staircase in a simulation environment according to embodiment 1 of the present disclosure;

FIG. 8 is a diagram of an implementation of the model characteristics of a crowd cellular automata in embodiment 2 of the present disclosure;

FIG. 9 is a schematic diagram of the operating logic of each frame of the flame entity;

fig. 10 is a schematic view of an operation logic of the evacuation robot per frame;

fig. 11 is an evacuation route planning diagram;

FIG. 12 is a logical diagram of the behavior of a human (leader) per frame;

FIG. 13 is a logic diagram of human (follower) behavior per frame;

FIG. 14 is a schematic diagram of the simulation of the present invention.

Detailed Description

The present invention will be further illustrated with reference to the accompanying drawings and specific embodiments, which are to be understood as merely illustrative of the invention and not as limiting the scope of the invention. It should be noted that the terms "front," "back," "left," "right," "upper" and "lower" used in the following description refer to directions in the drawings, and the terms "inner" and "outer" refer to directions toward and away from, respectively, the geometric center of a particular component.

Referring to fig. 1, the present embodiment describes a crowd emergency evacuation robot model based on countermeasure reinforcement learning as follows:

1) And (5) modeling scene simulation. The method utilizes a Unity engine to model the fire environment of a multi-storey building and enables the building to have a perfect building structure and a fire development mode.

1-1) constructing basic obstacles, ground and other interactive blocks and making corresponding static composition elements.

1-2) setting the penetrability of the corresponding constituent element, and parameters in the NavMesh navigation system.

1-3) building the building elements and building the building body to divide it into three layers, adjusting its range to carry out the experiment.

1-4) building an independent auxiliary structure, a staircase, using two staircases to connect each storey of the building body.

1-4-1) constructing a cubicle such that it is first necessary to enter the cubicle to travel from the building body to the stairwell;

1-4-2) the small rooms on each floor are connected by stairs;

1-4-3) arranging a relay platform in the middle of the stairs between the floors to ensure that the small rooms of the stairs on each floor have the same horizontal position;

1-4-4) arranging a purple stair opening mark at both ends of each stair section;

1-4-5) writing a function to enable the stair opening to record the stair direction of the stair opening and the information of the stair opening at the other end.

1-5) each floor is divided into a plurality of rooms by walls, and the rooms are connected with each other by blue common doors.

1-6) the floors are connected by two independent staircase structures, and the staircase platforms and the staircases of each floor are connected with each other by purple stairway openings.

1-7) changing the layout of each floor of rooms in the simulated escape environment, so that the number and complexity of the rooms are reduced along with the increase of floors.

2) And constructing a human behavior model. The crowd movement model is combined for comprehensive design, and the crowd movement model is used for simulating the condition that the visual field is limited and no prior knowledge exists on the building structure.

2-1) constructing a plurality of teams which are led by the leader based on the team concept;

2-2) the leader can take various gate structures in the visual field as a moving target and continuously explore the surrounding environment; the follower continuously moves to the vicinity of the leader to avoid falling behind;

2-3) the evacuation robot can also be used as a leader of a team, and the human leader can become a follower of the evacuation robot to obtain the leader of the team;

2-4) when the leader observes a team with more people, the leader becomes a follower of a member of the team;

2-5) when the follower can not find the leader of the follower in the visual field range due to the reasons of queue falling or no other people around the generation position, the follower can be converted into a leader mode to actively explore the surrounding environment;

2-6) finally forming a tree structure team taking the top leader as a root node.

3) And (5) training the evacuation robot. An agent-based machine learning framework and a multi-agent post-hoc credit allocation (MA-POCA) reinforcement learning algorithm is used.

3-1) loading the simulation training scene constructed in the step S1 by using an MA-POCA algorithm;

the MA-POCA algorithm adopts the idea of centralized training-distributed execution, and the strategy of each independent agent is represented by pi i, i is more than or equal to 1 and less than or equal to N. Given an environmental state s _t And corresponding joint observation o _t And action a _t Since agents are independent on local observations, the union strategy is π (a) _t |o _t ) Can be factored into

State s of _t Is defined as a centralized state cost function

A centralized state-action cost function of

3-2) each agent transmits the observation, action and reward to the central controller to train the strategy network and the value network;

3-3) adjusting the hyper-parameters such as batch size, learning rate and the like according to the training feedback to correct the model;

3-4) saving and using the trained model to simulate and evaluate the result.

4) And (3) implementation of a antagonistic fire source generator agent. Using a antagonism algorithm to decide the generation position of the next fire source in the fire scene;

4-1) acquiring the ground information of four surrounding grids;

4-2) judging whether positions capable of generating flames exist in the four surrounding grids or not, if so, entering a step S4-3, and if not, ending;

4-3) judging whether the spreading timer is cleared or not, if yes, entering a step S4-4, and if not, entering the spreading timer-1;

4-4) resetting the epidemic timer to random 200-600;

4-5) judging whether 20% random selection is passed, if so, entering the step S4-6, and if not, ending.

4-6) generating a new fire source on the corresponding ground grid and ending the algorithm.

5) Modeling the task using a method of countering reinforcement learning and forming a generator-solver model;

5-1) arranging an evacuation robot solver;

5-2) acquiring an observed value by the evacuation robot;

5-2-1) obtaining three-dimensional position coordinates of all human beings in the environment;

5-2-2) obtaining three-dimensional position coordinates of all evacuation robots in the environment;

5-2-3) obtaining the scale of the evacuation robot team on the layer;

5-2-4) obtaining a time lapse value from the start of evacuation;

5-2-5) observing all data from S5-2-1 to S5-2-4 as primary vectors, repeatedly acquiring the primary vectors four times and taking the primary vectors as input;

5-2-6) encoding each mesh in the 40 x 40 range into a pattern of vectors to represent the entity types contained within the mesh;

5-2-7) compressing the vector obtained in the step S5-2-6 into a picture in a PNG format, taking the length of the vector as the number of channels of the picture, and sending the length of the vector into a double-layer convolutional neural network for feature extraction;

5-2-8) the features extracted in step S5-2-7 are used as input for observing the static elements.

5-3) setting an action space function of the evacuation robot;

5-3-1) taking continuous actions obtained by model decision as action spaces;

5-3-2) processing the model decision result as a mobile coordinate;

5-3-3) the resulting motion space function is:

wherein (X) _r ,Y _r ) Coordinates representing the horizontal position to which it is planned to go next, (X) _a ,Y _a ) And (X) _b ,Y _b ) Respectively represents the central positions of the south stairwell and the north stairwell of the floor, (X) _r ′,Y _r ') represents the actual moving target position of the evacuation robot.

5-4) setting a reward function of the evacuation robot;

5-4-1) distributing the three evacuation robots to the intelligent agent group of the same MA-POCA algorithm;

5-4-2) respectively setting group rewards and individual rewards for the three evacuation robots;

5-4-3) the resulting group reward function is:

5-4-4) the resulting individual reward function is:

individual reward-0.01 romanAmount per frame settlement

5-5) setting a fire source intelligent body as a generator;

the fire source agent is used as a generator, and a near-end Policy Optimization algorithm (PPO) is used. The PPO algorithm is a novel Policy Gradient algorithm which is very sensitive to step size but difficult to select proper step size, and the change difference of new strategies and old strategies in the training process is not beneficial to learning if the change difference is too large. PPO provides a new target function, and a plurality of training steps can be performed to realize small-batch updating, so that the problem that the step size in the Policy Gradient algorithm is difficult to determine is solved.

PPO is also divided into an Actor portion and a Critic portion.

First, the merit function is defined:

part of updating Actor we set the reward function to:

herein, the

It should sample out the calculated Advantage function with the new policy, but can use the ^ or ^ of the old policy because the parameters do not change much>

To approximate the new strategy->

The former is set because we can only sample from the old policy and not from the new policy, important Sampling. While doing so may shift the policy from on-policy to off-policy, old policies may collect a lot of data, and then train the network many times with this data, and then re-sample. Upon starting a new or old strategy so>

The value will be more and more different after a number of iterations of the formula. Thus, when the value of the new strategy is far larger than the old probability, the updating is faster, but due to the existence of the latter KL divergence, the new strategy and the old strategy with the overlarge probability distribution difference are not updated too fast,the magnitude of the update of the new strategy is limited. For the Critic part, a network of output state values is constructed, and a training network is close to a preset value:

this Loss is trained to be 0 by a gradient decrement.

5-6) the fire source intelligent agent obtains an observed value;

5-6-1) obtaining three-dimensional position coordinates of all human beings in the environment;

5-6-2) obtaining three-dimensional position coordinates of all evacuation robots in the environment;

5-6-3) obtaining the scale of the evacuation robot team on the layer;

5-6-4) obtaining a time lapse value after the evacuation starts;

5-6-5) observing all data from S5-6-1 to S5-6-4 as a vector, repeatedly obtaining the vector four times and taking the vector as input;

5-6-6) encoding each mesh in the 40 x 40 range into a pattern of vectors to represent the entity types contained within the mesh;

5-6-7) compressing the vector obtained in the step 5-6-6 into a picture in a PNG format, taking the length of the vector as the number of channels of the picture, and sending the length of the vector into a double-layer convolutional neural network for feature extraction;

5-6-8) the features extracted in step S5-6-7 are used as input for observing the static elements.

5-7) setting an action space function of the fire source intelligent agent;

5-7-1) combining the motion space into one discrete motion and two continuous motions;

5-7-2) dividing the discrete actions into three branch options representing the number of floors producing flames;

5-7-3) setting continuous action to be consistent with the evacuation robot to represent the coordinates of the flame generation position.

5-7-4) finally obtain the following motion space function:

wherein (X) _f ,Y _f ) Horizontal coordinates representing the location of the fire source, (X) _f ′,Y _f ') represents the actual generation location of the next fire source controlled by the fire source generator.

5-8) setting a reward function of the fire source intelligent agent;

6) And (4) training for multiple times by using a Unity engine, counting data, and finally obtaining the robot model with the optimal evacuation crowd effect.

6-1) constructing a simulation scene by utilizing the three-storey building scene, the human intelligent agent, the evacuation robot intelligent agent and the fire source generation intelligent agent which are set in the step;

6-2) generating a simpler environment by using a fire source intelligent agent of a generative confrontation reinforcement learning technology as a generator;

6-3) the evacuation robot is used as a resolver and quickly and safely guides human evacuation in the building according to a reward and punishment mechanism;

6-4) the generator gets feedback of the training result and creates more challenging fire environment to challenge the evacuation robot;

6-5) repeating the training processes from 6-3) to 6-4), so that the fire source generator learns to generate the fire source at the position where the environment is more dangerous, and meanwhile, the evacuation robot also improves generalization capability and robustness in various fire environments.

In the embodiment, the fire source generator and the evacuation robot model are trained together by using the neural network, and in the training process, the fire source generator can help the evacuation robot intelligent body to improve the adaptability and robustness to different environments by continuously generating various challenging fire development modes, and finally, the effects shown in the following table are obtained. An evacuation robot agent in a virtual environment is trained by using a multi-agent post credit allocation (MA-POCA) reinforcement learning algorithm, so that all human beings can be guided to an exit before a fire disaster is completely spread, and an evacuation task is smoothly completed. By combining the technologies and the methods, the problems of limitation and low efficiency existing in the existing robot evacuation behavior model are solved, and the task of efficiently and emergently evacuating people by the robot is realized. Compared with the evacuation robot which is not used, the greedy robot can reduce the total evacuation time by 38.46%, the average human residence time by 40.32% and the human mortality by 90.14%. And the evacuation robot based on the antagonism reinforcement learning is superior to the greedy robot in three indexes, namely the evacuation robot is reduced by 45.69%,47.97% and 96.34% respectively compared with the evacuation robot without using the evacuation robot, and is reduced by 11.76%,12.82% and 63.01% respectively compared with the evacuation robot using the greedy robot.

Table: when the fire source is randomly generated, various evaluation results of the inorganic robot, the greedy robot and the reinforcement learning robot

The technical means disclosed in the invention scheme are not limited to the technical means disclosed in the above embodiments, but also include the technical scheme formed by any combination of the above technical features.

Claims

1. A crowd emergency evacuation robot model based on confrontation reinforcement learning is characterized in that: the method comprises the following steps:

s5, modeling the task by using a counterreinforcement learning method, and forming a 'generator-solver' mode;

and S6, training for multiple times by using a Unity engine, counting data, and finally obtaining the robot model with the optimal evacuation crowd effect.

2. The crowd emergency evacuation robot model based on confronted reinforcement learning of claim 1, wherein the detailed operation of step S1 comprises the following steps: s1-1, constructing basic barriers, the ground and other interactive blocks and manufacturing corresponding static composition elements;

s1-2, setting penetrability of corresponding component elements and parameters in a NavMesh navigation system;

s1-3, building construction elements, building a building main body to divide the building main body into three layers, and adjusting the range of the building main body to carry out experiments;

s1-4, constructing an independent auxiliary structure, namely a staircase, and connecting each layer of the building main body by using two staircases; the specific construction process of the step S1-4 comprises the following steps:

s1-4-1, constructing a small room as the transition from a building main body to a staircase, and connecting the small room with the staircase; s1-4-2, arranging a relay platform between layers to ensure the horizontal position consistency of each layer;

s1-4-3, setting a recording function to record the direction of the stair and the information of the stair opening at the other end; s1-5, dividing rooms in each floor into a plurality of rooms, wherein the rooms are connected through blue common doors;

3. The crowd emergency evacuation robot model based on confrontation reinforcement learning of claim 1, wherein the specific processing procedure of step S2 comprises the following steps: s2-1, constructing a plurality of teams guided by the leader based on team concepts;

s2-2, the leader can continuously explore the surrounding environment by taking various gate structures in the visual field as moving targets; the follower continuously moves to the vicinity of the leader to avoid falling behind;

s2-4, when the leader observes a team with more people, the leader becomes a follower of a certain member of the team;

4. The crowd emergency evacuation robot model based on confrontation reinforcement learning of claim 1, wherein the specific processing procedure of step S3 comprises the following steps: s3-1, loading the simulation training scene constructed in the S1 by using an MA-POCA algorithm;

s3-2, transmitting own observation, action and reward to a central controller centralized training strategy network and a value network when each agent trains;

s3-3, adjusting the hyper-parameters such as batch size, learning rate and the like according to the training feedback to correct the model;

and S3-4, storing, simulating by using the trained model and evaluating the result.

5. The crowd emergency evacuation robot model based on confrontation reinforcement learning according to claim 1, wherein the construction method of fire source generator agent based on confrontation reinforcement learning of step S4 comprises the following steps:

and S4-3, judging whether corresponding conditions are met and generating flames.

6. The crowd emergency evacuation robot model based on reinforcement learning confrontation according to claim 1, wherein the reinforcement learning confrontation based method of step S5 models the task to form a generator-solver process, comprising the following steps: s5-1, calling, collecting, observing and collecting environmental parameter information and sending the environmental parameter information into an algorithm;

the process of collecting environmental parameter information and sending the environmental parameter information into an algorithm in the step S5-1 comprises the following steps: s5-1-1, acquiring three-dimensional position coordinates of all human beings in the environment;

s5-1-3, acquiring the scale of the evacuation robot team on the layer;

s5-1-4, acquiring a time lapse value after evacuation starts;

s5-1-5, observing all data from S5-2-1 to S5-2-4 as a primary vector, repeatedly acquiring the data four times and taking the data as input;

s5-1-6, coding each grid in the experimental range into a vector mode to represent entity types contained in the grids;

s5-1-7, compressing the vector obtained in the step S5-1-6 into a picture in a PNG format, taking the length of the vector as the number of channels of the picture, and sending the length of the vector into a double-layer convolutional neural network for feature extraction; and S5-1-8, taking the characteristics extracted in the step S5-1-7 as input of the observation static elements.

S5-2, taking the evacuation robot as a resolver;

s5-3, constructing and setting an action space function and a reward function of the evacuation robot according to the relevant parameters; s5-3, setting an action space function and a reward function of the evacuation robot, and S5-4, taking a fire source intelligent agent as a generator;

s5-5, constructing and setting an action space function and a reward function of the fire source intelligent agent according to the relevant parameters; s5-6, simulating and generating a group of parameters integrated in the environment for evaluating the effect;

the process of step S5-6 of generating environment-integrated parameters for evaluating effects, comprising the steps of:

s5-6-3, sending the updated parameters to a resolver and a generator;

and S5-7, performing feedback adjustment on the solver and the generator by using the evaluation parameters.

7. The confrontation-reinforcement-learning-based crowd emergency evacuation robot model of claim 6, wherein S5-3 comprises the following steps:

s5-3-1, distributing the three evacuation robots to an intelligent agent group of the same MA-POCA algorithm; and S5-3-2, respectively setting group rewards and individual rewards for the three evacuation robots.

And S5-3-3, taking the continuous action obtained by the model decision as an action space S5-3-4, and processing the model decision result as a moving coordinate.

8. The crowd emergency evacuation robot model based on confrontation reinforcement learning as claimed in claim 6, wherein the action space function and reward function setting process of the fire source agent in step S5-5 comprises the following steps:

s5-5-1, setting a reward function for the fire source intelligent agent;

and S5-5-4, setting continuous actions consistent with the evacuation robot to represent the coordinates of the flame generation position. And S5-7, performing a feedback regulation process on the resolver and the generator by using the evaluation parameters, wherein the feedback regulation process comprises the following steps:

s5-7-1, transmitting the related parameters to an evacuation robot reward function to update the action decision; and S5-7-2, transmitting the related parameters to the fire source intelligent agent reward function so as to update the action decision.

9. The crowd emergency evacuation robot model based on confrontation reinforcement learning of claim 1, wherein the specific process of step S6 comprises the following steps:

s6-4, the generator obtains feedback of the training result and creates a more challenging fire environment to challenge the evacuation robot;

10. The crowd emergency evacuation robot model based on antagonistic reinforcement learning according to claim 8, wherein the concrete punishment mechanism of step S6-3 includes the following aspects:

s6-3-1, setting group rewards as common targets of three evacuation robots: evacuate all people as soon as possible and avoid any human death in the environment;

s6-3-2, whenever a human is successfully evacuated from an exit, the group reward of this adds to the remaining life value of the human at the time of evacuation. The shorter the time a human is in a fire environment, the more the remaining life value and the more the reward given to the robot agent group;

s6-3-3, if a human being is dead due to being retained in a fire environment for too long time or being out of the way because of being blocked by flames, group rewards are deducted reversely, and in this way, the evacuation robot intelligent agent learns to rescue the human being in a more dangerous state preferentially;

s6-3-4, when people who are not evacuated still exist in the floor, the evacuation robot in charge of the floor is given a certain punishment along with the time, and the more people who are not evacuated in the floor, the higher the punishment of each frame is;

s6-3-5, the individual reward function is set to encourage the robot to empty the human beings on the floor as soon as possible, and meanwhile, the robot is encouraged to guide the human teams to the stair openings which do not cause the reduction of the overall evacuation efficiency.