CN116307655B

CN116307655B - Evacuation elevator dispatching strategy optimization method, device, equipment and readable storage medium

Info

Publication number: CN116307655B
Application number: CN202310596462.3A
Authority: CN
Inventors: 马剑; 宋丹丹; 陈娟; 王巧; 何姗姗; 曾益萍; 吴国华; 丁恒
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-05-25
Filing date: 2023-05-25
Publication date: 2023-08-08
Anticipated expiration: 2043-05-25
Also published as: CN116307655A

Abstract

The invention provides a method, a device, equipment and a readable storage medium for optimizing an evacuation elevator dispatching strategy, which relate to the technical field of artificial intelligence and comprise the steps of constructing a building simulation scene containing an elevator, constructing a personnel evacuation scene in the building simulation scene, and determining an initial environmental state in the personnel evacuation scene; carrying out evacuation simulation training on the building simulation scene; respectively calculating the rewards corresponding to each instruction in the evacuation simulation training, and summing the rewards corresponding to all the instructions to obtain a total rewards of the evacuation simulation training; repeating evacuation simulation training on the building simulation scene, calculating to obtain the corresponding rewards total value of each evacuation simulation, stopping repeating the evacuation simulation until a plurality of continuously calculated rewards total values reach a convergence state, and obtaining the trained evacuation elevator dispatching model. The method and the device are used for solving the technical problem that the existing fixed scheduling strategy is difficult to cope with real-time and changeable emergency situations, and finally the evacuation efficiency is low.

Description

Evacuation elevator dispatching strategy optimization method, device, equipment and readable storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to an evacuation elevator dispatching strategy optimization method, an evacuation elevator dispatching strategy optimization device, evacuation elevator dispatching equipment and a readable storage medium.

Background

Pedestrian evacuation in emergency is an important branch in the field of security, and has been widely studied by students at home and abroad for many years. With the increasing number of building floors, the problem of evacuating people from multiple buildings in emergency situations is a difficult problem. Due to the characteristics of crowding and slow evacuation speed of stair evacuation, elevator evacuation gradually becomes one of important auxiliary modes for assisting stair evacuation of multi-storey buildings. The traditional elevator dispatching is mainly a first-come first-serve algorithm, a scanning algorithm, a LOOK algorithm and the like which are programmed based on specific requirements, but the algorithms need a large amount of calculation, the time cost is long, the obtained result cannot ensure to obtain the optimal effect, and the change of the evacuation scene is difficult to deal with in real time.

Disclosure of Invention

The invention aims to provide an evacuation elevator dispatching strategy optimization method, device, equipment and computer readable storage medium so as to solve the problems. In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

in a first aspect, the present application provides a method for optimizing an evacuation elevator dispatching policy, including: building a building simulation scene containing an elevator, building a personnel evacuation scene in the building simulation scene, and determining an initial environment state in the personnel evacuation scene;

evacuation simulation training is carried out on the building simulation scene: establishing a learning network, and inputting an initial environment state into the learning network to generate a current instruction; the elevator executes the current instruction to enter a next environmental state, and repeatedly inputs the next environmental state into the learning network to obtain the next instruction, and stops generating the instruction until the ending condition is reached;

respectively calculating the rewards corresponding to each instruction in the evacuation simulation training, and summing the rewards corresponding to all the instructions to obtain a total rewards of the evacuation simulation training;

repeating evacuation simulation training on the building simulation scene, calculating to obtain a reward total value corresponding to each evacuation simulation, stopping repeating the evacuation simulation until a plurality of reward total values obtained by continuous calculation reach a convergence state, and obtaining a trained evacuation elevator dispatching model, wherein the evacuation elevator dispatching model is used for generating an evacuation elevator dispatching strategy.

In a second aspect, the present application further provides an evacuation elevator dispatching policy optimization device, including:

and a scene building module: the method comprises the steps of constructing a building simulation scene containing an elevator, constructing a personnel evacuation scene in the building simulation scene, and determining an initial environment state in the personnel evacuation scene;

a first simulation module: the evacuation simulation training method is used for carrying out evacuation simulation training on the building simulation scene: establishing a learning network, and inputting an initial environment state into the learning network to generate a current instruction; the elevator executes the current instruction to enter a next environmental state, and repeatedly inputs the next environmental state into the learning network to obtain the next instruction, and stops generating the instruction until the ending condition is reached;

the calculation module: respectively calculating the rewards corresponding to each instruction in the evacuation simulation training, and summing the rewards corresponding to all the instructions to obtain a total rewards of the evacuation simulation training;

and a second simulation module: the evacuation elevator dispatching method comprises the steps of carrying out evacuation simulation training on a building simulation scene repeatedly, calculating to obtain the corresponding reward total value of each evacuation simulation, stopping repeating the evacuation simulation until a plurality of continuously calculated reward total values reach a convergence state, and obtaining a trained evacuation elevator dispatching model, wherein the evacuation elevator dispatching model is used for generating an evacuation elevator dispatching strategy.

In a third aspect, the present application further provides an evacuation elevator dispatching policy optimization device, including:

a memory for storing a computer program;

and the processor is used for realizing the steps of the evacuation elevator dispatching strategy optimization method when executing the computer program.

In a fourth aspect, the present application also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the above-described evacuation elevator dispatching policy-based optimization method.

The beneficial effects of the invention are as follows:

according to the elevator dispatching optimization method for reducing the elevator evacuation time, the elevator dispatching optimization method for reducing the elevator evacuation time is provided by utilizing the deep reinforcement learning algorithm for the first time, the elevator operation logic in the evacuation scene is determined, and the dynamic dispatching is carried out according to the real-time environment condition, so that the rapid and efficient evacuation of the personnel in the multi-story building is realized under the emergency condition, the elevator evacuation efficiency is improved, and the personnel safety is ensured.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the embodiments of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of an evacuation elevator dispatching strategy optimization method according to an embodiment of the present invention;

fig. 2 is a logic diagram of an elevator in an embodiment of the present invention;

FIG. 3 is a diagram of a learning logic framework of a deep Q network in an embodiment of the invention;

fig. 4 is a schematic structural diagram of an evacuation elevator dispatching policy optimization device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an evacuation elevator dispatching policy optimization device according to an embodiment of the present invention.

The marks in the figure:

800. evacuation elevator dispatching strategy optimizing equipment; 801. a processor; 802. a memory; 803. a multimedia component; 804. an I/O interface; 805. a communication component.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. Meanwhile, in the description of the present invention, the terms "first", "second", and the like are used only to distinguish the description, and are not to be construed as indicating or implying relative importance.

Example 1:

the embodiment provides an evacuation elevator dispatching strategy optimization method.

Referring to fig. 1, the method is shown to include:

s1, building a building simulation scene containing an elevator, building a personnel evacuation scene in the building simulation scene, and determining an initial environmental state in the personnel evacuation scene; the building containing the elevator may be a house, an office building or a transportation hub, such as a subway station, a high-speed rail station, etc.

Firstly, parameters such as the number of floors of a building, the number of evacuated persons on each floor, the number of elevator running floors, the position of an elevator, the number of persons selecting the elevator for use on each floor and the like are obtained, and a person evacuation scene is constructed according to the parameters;

specifically, the evacuation people at each floor is set according to the situation of the fixed people at each floor in the building, for example, if the residential building is a residential building, the evacuation people at each floor can be determined according to the fixed residents at the corresponding floor, if the residential building is an office building, the evacuation people at each floor can be estimated according to daily passenger flow for multi-layer stations such as transportation hubs. The situation of the people needing to use the elevator or tending to use the elevator to evacuate in the evacuation scene can be obtained by using a questionnaire and other modes for the people selecting the elevator to evacuate.

Based on the above embodiment, the method further includes:

s2, carrying out evacuation simulation training on the building simulation scene;

specifically, step S2 includes the steps of:

s2, establishing a learning network, specifically, establishing a Deep Q Network (DQN), wherein the deep Q network comprises two learning networks and target networks which are identical in structure.

And initializing parameters in deep Q networkIncluding learning parameters of the network and the target network; it should be noted that, the depth Q network is the optimal choice of the present invention, and the purpose of elevator dispatching optimization of the present invention can also be achieved by using other reinforcement learning algorithms;

s21, inputting an initial environment state into a learning network to generate a current instruction, and specifically:

s211, from an initial environment state, comprising:

wherein:

1）；

in the method, in the process of the invention,representing each buildingThe number of people waiting for the elevator at a floor, m represents the total floor number of the building,/->Representing the number of people waiting for an elevator on the first floor, +.>Representing the number of people waiting for the elevator at the m-th floor;

2）；

in the method, in the process of the invention,indicating the total waiting time of the personnel on each floor +.>Representing the waiting time of the person at layer m, in particular:

；

in the method, in the process of the invention,indicating the current moment +.>Indicating the moment when the jth person arrives at the elevator door.

3）；

In the method, in the process of the invention,indicating whether the elevator is fully loaded, when c=1 indicates that the elevator is fully loaded, c=0 indicates that the elevator is not fully loaded;

4）；

in the method, in the process of the invention,indicating the floor at which the elevator is located, where p ₁ … p _m Each element has a value of 0 or 1,0 represents that the elevator position is not on the floor at the moment, and 1 represents the floor where the elevator is on at the moment;

5）indicating evacuation completion time, when personnel evacuation is not completed +.>；

6）Representing the number of people needing to be evacuated in the building, representing the number of people needing to be evacuated in the building at the current moment, and starting evacuation +.>The value of (2) is equal to the total number of people in the building, and the evacuation is ended and the evacuation is stopped>。

According to s ₁ Constructing a first state feature vector Φ ₁ 。

S212, referring to FIG. 3, inputting the first state feature vector into a depth Q network to generate all actions executable by the elevator and values (Q values) corresponding to all actions;

specifically, the elevator action is defined according to the elevator operation logic as:

；

wherein the method comprises the steps ofIndicating the destination floor to stay +.>Destination floor rangeAnd controlling the elevator to go to the target floor to stop according to the instruction for all floors from the bottom floor to the top floor.

The elevator only stops at the destination floor, the evacuation time can be prolonged due to the fact that the elevator switch door time for stopping at too many floors is increased, the elevator carrying times can be increased due to too few stops, the operation time of the elevator is prolonged, and therefore the priority of the destination floors and the number of floors to be moved can influence the evacuation time.

S21, determining one action from all actions as a current instruction by adopting an epsilon-greedy method according to the values corresponding to all actions。

S22, executing current instruction by elevator to enter next environment state s ₂ ；

Referring to fig. 2, specifically, the step S22 includes:

acquiring the floor where the elevator at the current moment is located from the initial environment state:

if the elevator at the current moment is not at the target floor, the elevator is lifted to the target floor and stopped, and after stopping, whether the elevator is fully loaded is judged: if the elevator is fully loaded, the elevator runs downwards to the bottom layer; otherwise, the elevator continues to run downwards;

if the elevator at the current moment is positioned at the target floor and is not fully loaded, stopping at the target floor, and judging whether the elevator is fully loaded or not after stopping: if the elevator is fully loaded, the elevator runs downwards to the bottom layer; otherwise, the elevator continues to run downwards;

current instruction for elevator executionAfter the action in (a) entering the next environmental state s ₂ 。

S23, acquiring a first state feature vector and a second state feature vector, judging whether a termination condition is met, and obtaining a judgment result;

storing the first state feature vector, the current instruction, the rewarding value of the current instruction, the second state feature vector and the judging result as one sample data into a preset experience playback set;

s24, repeating the next environmental state s ₂ And inputting a next instruction into the learning network, stopping generating the instruction until the termination condition is reached, and completing the simulation training of the first round.

Specifically, the step S24 includes:

according to the next environmental state s ₂ Construction of a second state eigenvector Φ ₂ ；

Second state characteristic vector phi ₂ Inputting the values into a learning network, and generating all executable actions of the elevator and values corresponding to all actions;

determining one action from all actions by adopting an epsilon-greedy method as a next instruction according to the values corresponding to all actions;

judging whether a termination condition is reached:

if yes, taking the sum of the rewards of the current instruction and all the previous instructions as the total rewards of the training;

otherwise, the instruction is generated again until a termination condition is reached.

The termination condition is that people are completely evacuated or the execution times of the elevator reach the first preset times, and the maximum time of each round of simulation can be controlled by setting a maximum iteration time for each round of training, so that the total time required by training is reduced.

S25, updating parameters of a deep Q network, namely updating parameters of a learning network and a target network, wherein the method specifically comprises the following steps:

s251, carrying out current state characteristic vector phi after each instruction execution of the elevator _t Current instruction a _t Prize value r for current instruction _t Next state feature vector Φ _t+1 Terminal as a sample data D _t ={Φ _t , a _t , r _t , Φ _t+1 Terminal is stored in an experience playback set D;

specifically, the terminal indicates a termination state, which is a state in which the time reaches the termination time Te or the time does not reach the termination time but all the floor persons are evacuated to completion (evanum=0).

S252, randomly extracting a plurality of sample data from the experience playback set when the execution times of the elevator reach a second preset times; the second preset times are less than the first preset times;

specifically, m sample data are sampled from the experience playback set D;

s253, extracting each extracted sample data D _j J=1, 2,3, …, m is input into a learning network and a target network, and the target value and the actual value of the sample data are obtained through calculation;

s254, calculating a value difference value of each sample data in the sample set according to the target value and the actual value, and specifically:

state characteristic vector phi of current environment state in sample data _j Inputting into learning network to obtain actual value of current environment state：

；

In the method, in the process of the invention,instruction representing the current environmental state, +.>Representing parameters of the learning network.

State feature vector for next environmental state in sample dataInputting into a target network, calculating to obtain the target value of the current environment state>；

；（1）

Wherein r is _j Indicating the value of the prize,represents the attenuation factor, the value range is (0, 1),representing the maximum value of the next environmental state, < +.>An instruction indicating the next environmental state,indicating that the parameters of the target network are updated after a time step.

Calculating the difference between the actual value and the target value of the current environmental state to obtain a value difference of the sample data。

S255, constructing a mean square error loss function by using the value difference values of all sample data in the sample set:

；（2）

s256, according to the mean square error loss function, updating parameters of the deep Q network by gradient back propagation of the deep Q networkSpecifically, the learning network updates the parameters of the learning network by learning the value difference value, and the target network makes the target network parameters equal to the learning network parameters after the learning network is updated for a certain number of times, so that the parameter update of the target network is realized.

Based on the above embodiment, the method further includes:

s3, respectively calculating the rewards corresponding to each instruction in the evacuation simulation training, and summing the rewards corresponding to all the instructions to obtain a total rewards of the evacuation simulation training;

specifically, the step S3 includes:

s31, acquiring first total waiting time of each floor person in an initial environment state and second total waiting time of each floor person in a next environment state;

s32, calculating to obtain the total waiting change time of the personnel according to the first total waiting time and the second total waiting time;

s33, judging whether people in the next environmental state are completely evacuated:

if the evacuation is finished, acquiring total evacuation time, and calculating to obtain a reward value corresponding to the instruction according to the total evacuation time and the total waiting change time;

otherwise, calculating to obtain the rewarding value corresponding to the instruction according to the total waiting change time.

Specifically, the rewarding value is calculated by adopting a rewarding function, the formulation of the rewarding function needs to meet the dispatching optimization target of the evacuation elevator, the purpose of dispatching optimization is to minimize the evacuation time of the elevator and the waiting time of the personnel, and the purpose of dispatching optimization is to convert the evacuation time into the rewarding function so as to maximize the rewarding value. The reward function is as follows:

；（3）

wherein r is ₁ Indicating the value of the prize,is->Weight of->Is->Weight of->And->The weight ranges of the weight are (0, 1), the weight size can be independently assigned according to the emphasis point of the optimization target, and the weight is +.>Normalized expression representing total waiting change time of person, set it as immediate rewards, ++>Representing t _i Time of day total waiting time of 1 to m floors of personnel in building,/->Representing t _i+1 The total waiting time of 1 to m layers of personnel in the building at any time; y is a positive constant, and the value range is (0, 1); />Representing final evacuation time +.>Is only when evacuation is ended +.>Can obtain rewards T _e To simulate the expiration time, T _e The time for evacuation of all people in the building using stairs in the simulation can be set.

When the elevator is at a time interval) If a person enters the elevator, waiting for the elevator person decreases, and the waiting time of the part of the person becomes 0, then t _i+1 The total waiting time at the moment is reduced by a value which is greater than the total increase of the evacuation time in a time interval, so that +.>And rewards are brought.

It should be noted that the purpose of the elevator dispatching optimization of the present invention is to minimize evacuation time and waiting time, and other rewarding functions of addition, replacement and improvement based on this are included in the scope of the present invention.

Assuming that the elevator executes T orders in the current evacuation simulation training, adding the reward values of the T orders to obtain a total reward value：

；（4）

S4, repeating evacuation simulation training on the building simulation scene, calculating to obtain a reward total value corresponding to each evacuation simulation, and stopping repeating the evacuation simulation until a plurality of reward total values obtained by continuous calculation reach a convergence state, so as to obtain a trained evacuation elevator dispatching model, wherein the evacuation elevator dispatching model is used for generating an evacuation elevator dispatching strategy;

specifically, according to steps S2 and S3, evacuation simulation training is performed on the building simulation scene for a plurality of times until the total rewards value reaches a convergence state, and a strategy Q network with convergence results, namely an evacuation elevator dispatching model, is obtained.

Finally, an evacuation elevator dispatching model is applied to elevator control in a real building:

obtaining the current building real-time environment state according to the internet of things technology: the personnel condition in the environment information is acquired by a monitoring camera in front of the elevator door, the information such as the waiting number in front of each elevator door at the current moment, the waiting time of the current personnel and the like is obtained by utilizing the related technologies such as data extraction, fusion, processing and the like, the pressure sensor of the elevator is used for obtaining the number of people in the current elevator, the floor where the elevator at the current moment is located and the like,

inputting the real-time environment state into an evacuation elevator dispatching model, generating an evacuation elevator dispatching strategy to carry out elevator self-adaptive regulation and control operation, and realizing personnel safety evacuation in the shortest time.

It should be noted that, the evacuation scene in the invention is set to evacuate downwards, but the underground space such as subway is a scene with evacuation upwards, and the dispatching optimization of evacuation upwards can be realized by using the technology, which is also included in the protection scope of the invention.

Example 2:

as shown in fig. 4, the present embodiment provides an evacuation elevator dispatching policy optimization device, which includes:

the calculation module: the method comprises the steps of respectively calculating the rewards corresponding to each instruction in evacuation simulation training, and summing the rewards corresponding to all the instructions to obtain a total rewards of the evacuation simulation training;

Based on the above embodiment, the first simulation module includes:

the establishing unit: the method comprises the steps of establishing a deep Q network and initializing parameters in the deep Q network;

a first construction unit: the method comprises the steps of obtaining the number of people waiting for an elevator at each floor at the current moment, the total waiting time of people at each floor, whether the elevator is fully loaded or not, and the number of people needing to be evacuated at the floor where the elevator is located from an initial environment state, and evacuating the total time to construct a first state feature vector;

a first generation unit: the method comprises the steps of inputting a first state feature vector into a deep Q network, and generating all actions executable by an elevator and values corresponding to all actions;

a first selection unit: and determining one action from all actions as a current instruction by adopting an epsilon-greedy method according to the values corresponding to all actions.

Based on the above embodiment, the first simulation module further includes:

a first acquisition unit: the elevator is used for acquiring the floor where the elevator at the current moment is located from the initial environment state, and if the elevator is located at the bottom layer, the elevator is lifted to the target floor and stays;

a first judgment unit: the method is used for judging whether the current floor of the elevator is a target floor or not:

if yes, the elevator stays in the floor;

otherwise, the elevator runs to the target layer;

and obtaining the next environmental state after the elevator is executed.

Based on the above embodiment, the first simulation module further includes:

a second construction unit: for constructing a second state feature vector based on the next environmental state;

a second generation unit: the second state feature vector is used for inputting the second state feature vector into a learning network to generate all actions executable by the elevator and values corresponding to all actions;

a second selection unit: the method is used for determining one action from all actions by adopting an epsilon-greedy method as a next instruction according to the values corresponding to all actions;

a second judgment unit: for determining whether a termination condition is reached:

The termination condition is that the personnel evacuation is finished or the execution times of the elevator reach the first preset times.

Based on the above embodiments, the calculation module includes:

a third selecting unit: acquiring a first total waiting time of each floor person in an initial environment state and a second total waiting time of each floor person in a next environment state;

a first calculation unit: the method comprises the steps of calculating the total waiting change time of a person according to first total waiting time and second total waiting time;

a third judgment unit: used for judging whether people in the next environmental state are completely evacuated:

Based on the above embodiment, the first simulation module further includes:

a second acquisition unit: the method comprises the steps of acquiring a first state characteristic vector and a second state characteristic vector;

fourth judgment unit: the method is used for judging whether the termination condition is met or not and obtaining a judgment result;

a first storage unit: and the first state characteristic vector, the current instruction, the rewarding value of the current instruction, the second state characteristic vector and the judging result are used as one sample data to be stored in a preset experience playback set.

Based on the above embodiment, the method further includes an updating module, specifically:

a second storage unit: the method comprises the steps that a current state feature vector, a current instruction, a rewarding value of the current instruction, a next state feature vector and a current judging result after each instruction execution of an elevator are used as one sample data to be stored into an experience playback set;

extraction unit: the system comprises a first step for randomly extracting a plurality of sample data from the experience playback set when the execution times of the elevator reach a second preset times;

learning calculation unit: the method comprises the steps of inputting each sample data in a sample set into a depth Q network to calculate the target value and the actual value of each sample data;

a second calculation unit: the value difference value is used for calculating each sample data in the sample set according to the target value and the actual value;

a third construction unit: constructing a mean square error loss function by using the value difference values of all sample data in the sample set;

an updating unit: for back-propagating through gradients of the deep Q network to update parameters of the deep Q network according to the mean square error loss function.

It should be noted that, regarding the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiments regarding the method, and will not be described in detail herein.

Example 3:

corresponding to the above method embodiment, an evacuation elevator dispatching policy optimizing device is further provided in this embodiment, and an evacuation elevator dispatching policy optimizing device described below and an evacuation elevator dispatching policy optimizing method described above may be referred to correspondingly with each other.

Fig. 5 is a block diagram of an evacuation elevator dispatch strategy optimization device 800 shown in accordance with an exemplary embodiment. As shown in fig. 5, the evacuation elevator dispatching policy optimization device 800 may include: a processor 801, a memory 802. The evacuation elevator dispatch policy optimization device 800 may also include one or more of a multimedia component 803, an I/O interface 804, and a communication component 805.

The processor 801 is configured to control the overall operation of the evacuation elevator dispatching policy optimization device 800, so as to complete all or part of the steps in the above-mentioned evacuation elevator dispatching policy optimization method. The memory 802 is used to store various types of data to support operation at the evacuation elevator dispatching policy optimization device 800, which may include, for example, instructions for any application or method operating on the evacuation elevator dispatching policy optimization device 800, as well as application related data, such as contact data, messages, pictures, audio, video, and the like. The Memory 802 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia component 803 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 802 or transmitted through the communication component 805. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 804 provides an interface between the processor 801 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. A communication component 805 is used for wired or wireless communication between the evacuation elevator dispatching policy optimization device 800 and other devices. Wireless communication, such as Wi-Fi, bluetooth, near field communication (Near FieldCommunication, NFC for short), 2G, 3G or 4G, or a combination of one or more thereof, the respective communication component 805 may thus comprise: wi-Fi module, bluetooth module, NFC module.

In an exemplary embodiment, the evacuation elevator dispatching policy optimization device 800 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), digital signal processors (DigitalSignal Processor, abbreviated as DSP), digital signal processing devices (Digital Signal Processing Device, abbreviated as DSPD), programmable logic devices (Programmable Logic Device, abbreviated as PLD), field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), controllers, microcontrollers, microprocessors, or other electronic components for performing the evacuation elevator dispatching policy optimization method described above.

In another exemplary embodiment, a computer readable storage medium is also provided comprising program instructions which, when executed by a processor, implement the steps of the above-described evacuation elevator scheduling policy optimization method. For example, the computer readable storage medium may be the above-described memory 802 comprising program instructions executable by the processor 801 of the evacuation elevator dispatching policy optimization device 800 to perform the above-described evacuation elevator dispatching policy optimization method.

Example 4:

corresponding to the above method embodiment, a computer readable storage medium is also provided in this embodiment, and a computer readable storage medium described below and an evacuation elevator dispatching policy optimization method described above may be referred to correspondingly.

A computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the evacuation elevator dispatching policy optimization method of the above method embodiment.

The computer readable storage medium may be a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, which may store program codes.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. An evacuation elevator dispatch strategy optimization method, comprising:

building a building simulation scene containing an elevator, building a personnel evacuation scene in the building simulation scene, and determining an initial environment state in the personnel evacuation scene;

evacuation simulation training is carried out on the building simulation scene: establishing a learning network, and inputting an initial environment state into the learning network to generate a current instruction; the elevator executes the current instruction to enter a next environmental state, the next environmental state is repeatedly input into a learning network to obtain the next instruction, and the instruction generation is stopped until a termination condition is reached, wherein the termination condition is that personnel evacuation is completed or the execution times of the elevator reach a first preset times, and the method comprises the following steps:

establishing a deep Q network, and initializing parameters in the deep Q network;

acquiring the number of people waiting for the elevator at each floor at the current moment, the total waiting time of the personnel at each floor, whether the elevator is fully loaded, the floor where the elevator is located and the number of people needing to be evacuated from the initial environment state, and evacuating the total time to construct a first state feature vector;

determining one action from all actions by adopting an epsilon-greedy method as a current instruction according to the values corresponding to all the actions;

calculating the reward value corresponding to each instruction in the evacuation simulation training, and summing the reward values corresponding to all the instructions to obtain a total reward value of the evacuation simulation training, wherein the method comprises the following steps:

acquiring a first total waiting time of each floor person in an initial environment state and a second total waiting time of each floor person in a next environment state;

calculating to obtain the total waiting change time of the personnel according to the first total waiting time and the second total waiting time;

judging whether people in the next environmental state are completely evacuated or not:

otherwise, calculating to obtain a reward value corresponding to the instruction according to the total waiting change time;

2. An evacuation elevator dispatch strategy optimization method according to claim 1, wherein the elevator executing a current instruction to enter a next environmental state, the current instruction being a destination floor to which the elevator is traveling, comprising:

and obtaining the next environmental state after the elevator is executed.

3. The evacuation elevator dispatching policy optimization method of claim 1, wherein inputting the next environmental state into the learning network to obtain the next instruction until the termination condition is reached stops generating instructions, comprising:

constructing a second state feature vector according to the next environmental state;

inputting the second state feature vector into a learning network to generate all actions executable by the elevator and values corresponding to all actions;

judging whether a termination condition is reached:

4. The evacuation elevator dispatching policy optimization method of claim 1, wherein after inputting a next environmental state into the learning network to obtain a next instruction, further comprising:

acquiring a first state feature vector and a second state feature vector;

judging whether a termination condition is met or not, and obtaining a judgment result;

and storing the first state characteristic vector, the current instruction, the rewarding value of the current instruction, the second state characteristic vector and the judging result as one sample data into a preset experience playback set.

5. The evacuation elevator dispatching policy optimization method of claim 4, wherein after inputting a next environmental state into the learning network to obtain a next instruction, further comprising updating parameters of the deep Q network, specifically:

storing the current state characteristic vector, the current instruction, the rewarding value of the current instruction, the next state characteristic vector and the current judging result after each instruction execution of the elevator as one sample data into an experience playback set;

when the execution times of the elevator reach a second preset times, randomly extracting a plurality of sample data from the experience playback set to form a sample set;

inputting each sample data in the sample set into a depth Q network to calculate to obtain a target value and an actual value of each sample data; calculating a value difference value of each sample data in the sample set according to the target value and the actual value;

constructing a mean square error loss function by using the value difference values of all sample data in the sample set;

the gradient back-propagation through the deep Q network updates the parameters of the deep Q network according to the mean square error loss function.

6. An evacuation elevator dispatch strategy optimization device, comprising:

a first simulation module: the evacuation simulation training method is used for carrying out evacuation simulation training on the building simulation scene: establishing a learning network, and inputting an initial environment state into the learning network to generate a current instruction; the elevator executes the current instruction to enter a next environmental state, the next environmental state is repeatedly input into a learning network to obtain the next instruction, and the instruction generation is stopped until a termination condition is reached, wherein the termination condition is that personnel evacuation is completed or the execution times of the elevator reach a first preset times, and the method comprises the following steps:

inputting the first state feature vector into a depth Q network to generate all actions executable by the elevator and values corresponding to all actions;

the calculation module: calculating the reward value corresponding to each instruction in the evacuation simulation training, and summing the reward values corresponding to all the instructions to obtain a total reward value of the evacuation simulation training, wherein the method comprises the following steps:

7. An evacuation elevator dispatch strategy optimization device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the evacuation elevator dispatching policy optimization method according to any one of claims 1 to 5 when executing the computer program.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the evacuation elevator scheduling policy optimization method of any of claims 1 to 5.