CN111376954B

CN111376954B - Train autonomous scheduling method and system

Info

Publication number: CN111376954B
Application number: CN202010481714.4A
Authority: CN
Inventors: 韦伟; 刘岭; 张波; 白光禹; 张晚秋
Original assignee: CRSC Research and Design Institute Group Co Ltd
Current assignee: CRSC Research and Design Institute Group Co Ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2020-09-29
Anticipated expiration: 2040-06-01
Also published as: CN111376954A

Abstract

The invention provides a train autonomous dispatching method and a system, wherein a simulation module receives rail transit data and simulates an actual rail transit system, the simulation module and a deep reinforcement learning module are interactively trained, the deep reinforcement learning module obtains a trained dispatching decision model, the deep reinforcement learning module transmits the trained dispatching decision model to a dispatching scheme module, the simulation module simulates the running state of a train and outputs the running state of the train to the dispatching scheme module, the dispatching scheme module generates a dispatching scheme based on the running state of the train, the dispatching scheme module transmits the dispatching scheme to the actual rail transit system, the simulation module simulates the actual rail transit system, the deep reinforcement learning module trains the dispatching model, each train carries out self-running control strategy adjustment according to the running environment, and therefore under the premise of ensuring the running safety and accuracy of the train, the energy consumption of train operation and the waiting time of passengers are reduced; the scheduling real-time performance and flexibility are high.

Description

Train autonomous scheduling method and system

Technical Field

The invention belongs to the field of rail transit, and particularly relates to an autonomous train scheduling method and system.

Background

In existing transportation organization models, transportation plans are typically compiled based on phase passenger flow demand forecasts. In a short time, due to the fluctuation of real-time passenger flow demand, there is a certain mismatch between transport supply and transport demand, resulting in a reduction in the service level of the transport system. Meanwhile, due to the influence of various external factors on the train in the running process, the running of the train gradually deviates from the running schedule and the set energy-saving control curve, and the accuracy and the energy-saving performance of the running of the train are difficult to guarantee.

The existing research on train operation scheduling is mainly centralized scheduling. In the centralized scheduling mode, most of the existing research is based on an optimization theory technology. The research of the type generates a train operation scheduling scheme by establishing an optimization model of train scheduling and solving the optimization model. In addition, some researchers also studied the train and the reverse scheduling strategy from the perspective of simulation.

However, in the running process of a train, the surrounding environment and the transportation requirement which need to be considered are extremely complex and present a strong dynamic time-varying characteristic, and the centralized scheduling method is often difficult to adapt to an application scenario with high requirements on instantaneity and flexibility. Under the nonlinear and real-time requirements of the track traffic system scheduling task, the scheduling organization method based on optimization not only has the problem of overlong solving time, but also has the limitation on the flexibility under the dynamic scheduling scene. Meanwhile, most of methods based on simulation only provide strategy support or simple strategy generation logic for train operation scheduling schemes, and are difficult to apply to train operation scheduling in complex scenes.

Disclosure of Invention

Aiming at the problems, the invention provides an autonomous train dispatching method.A simulation module receives rail transit data and simulates an actual rail transit system;

the simulation module and the deep reinforcement learning module are interactively trained, the deep reinforcement learning module obtains a trained scheduling decision model, and the deep reinforcement learning module transmits the trained scheduling decision model to the scheduling scheme module;

the simulation module simulates the current train running state and outputs the current train running state to the scheduling scheme module;

the scheduling scheme module generates a scheduling scheme based on the current train running state;

and the scheduling scheme module is used for transmitting the scheduling scheme to the actual rail transit system.

Preferably, the rail transit data includes train dynamics data, train operation plan data, fixed equipment state data, external environment data, passenger flow data, and train-to-train relationship data.

Preferably, the interactive training of the simulation module and the deep reinforcement learning module comprises:

the simulation module generates a train running state S;

generating a train action A based on a greedy policy;

the deep reinforcement learning module calculates a return R (S, A) based on the current train running state S and the train action A;

the simulation module generates a subsequent train operation state S' based on the current train operation state S and the train action A;

constructing N train quadruples (S, A, R, S');

updating the value function neural network parameters with the quadruple (S, a, R, S') until the value function neural network parameters reach a predetermined condition.

Preferably, the scheduling scheme module further transmits the scheduling scheme to the simulation module, and the simulation module simulates and forms the train running state based on the scheduling scheme.

Preferably, the simulation module evaluates the resulting train operating state.

Preferably, the evaluation indexes of the simulation module on the train running state comprise train punctuality rate, running safety, train running energy consumption and passenger waiting time.

Preferably, the simulation module calculates the return based on the train scheduling scheme, and feeds the return back to the deep reinforcement learning module, and the deep reinforcement learning module adjusts the scheduling decision model based on the return.

Preferably, the reward calculated by the simulation module based on the train scheduling scheme comprises a train punctual reward, a train safety reward, a train energy consumption reward and a waiting time reward.

Preferably, the deep reinforcement learning module receives rail transit data;

the deep reinforcement learning module performs off-line training based on the rail transit data.

The invention also provides an autonomous train dispatching system, which comprises

The simulation module is used for receiving the rail transit data, simulating an actual rail transit system, performing interactive training with the deep reinforcement learning module, generating a current train running state, and outputting the current train running state to the scheduling scheme module;

the deep reinforcement learning module is used for interactive training with the simulation module and transmitting the trained scheduling decision model to the scheduling scheme module;

and the scheduling scheme module is used for receiving the scheduling decision model trained by the deep reinforcement learning module, generating a scheduling scheme based on the current train running state and transmitting the scheduling scheme to the actual rail transit system.

Preferably, the rail transit system further comprises a data acquisition interface, wherein the data acquisition interface is used for acquiring data of an actual rail transit system and sending the acquired data to the simulation module.

Preferably, the data acquisition interface is further configured to send the acquired data to the deep reinforcement learning module.

Preferably, the simulation module comprises:

the rail transit system simulation kernel is used for simulating an actual rail transit system based on rail transit data, receiving a scheduling scheme and generating a train running state based on the scheduling scheme;

the simulation data acquisition and monitoring unit is used for acquiring data in a simulation kernel of the rail transit system;

the technical index counting and evaluating unit is used for evaluating the train running state formed by the rail transit system simulation kernel based on the data collected by the simulation data collecting and monitoring unit;

and the scheduling strategy return evaluation unit is used for calculating a return for the train scheduling scheme based on the data acquired by the simulation data acquisition and monitoring unit and feeding the return back to the deep reinforcement learning module.

According to the train autonomous scheduling method and system, the simulation module simulates an actual rail transit system, and the deep reinforcement learning module trains the scheduling model, so that each train can adjust the operation control strategy according to the operation environment, and therefore, the train operation energy consumption and the passenger waiting time are reduced on the premise of ensuring the train operation safety and accuracy; the scheduling real-time performance and flexibility are high, the scheduling task under a complex scene can be adapted, meanwhile, the train scheduling system is facilitated to be simplified, and the system construction cost is reduced; and the train operation strategy is directly generated according to the train operation environment, so that the high coupling of the generation and implementation of the train operation scheduling strategy can be realized, the intermediate process is reduced, and the reliability of the train operation scheduling is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 shows a schematic diagram of train autonomous scheduling principle;

FIG. 2 shows a deep reinforcement learning module and a simulation module interaction diagram;

FIG. 3 illustrates a schematic diagram of train autonomous dispatch;

FIG. 4 shows a hub/station simulation content relationship diagram;

FIG. 5 illustrates a wire mesh transportation process simulation content relationship diagram;

FIG. 6 shows a schematic diagram of the passenger latency cost calculation based on OD-SpaseSTnet;

figure 7 shows a train adopting action a_tThe energy consumption cost calculation schematic diagram is shown;

FIG. 8 illustrates a safety interval overrun cost calculation diagram;

FIG. 9 shows a diagram of a quasi-point overrun cost calculation;

FIG. 10 shows a value function neural network architecture diagram;

FIG. 11 shows a detailed flow chart of a DDQN considering prioritized empirical playback;

FIG. 12 is a schematic diagram of an autonomous train dispatching system;

FIG. 13 shows a schematic structural diagram of a simulation module;

FIG. 14 illustrates a track transportation network train autonomous dispatch distributed implementation architecture diagram;

FIG. 15 is a partial schematic structural diagram of a simulation module and a deep reinforcement learning module;

FIG. 16 is a partial structural diagram of the simulation module and the deep reinforcement learning module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

To solve the problems of the centralized scheduling mode, an embodiment of the present invention provides a method for autonomous scheduling of trains, and the principle of the method is shown in fig. 1. In the train autonomous dispatching mode, each train adjusts the train operation strategy according to the train state, the established transportation plan, the relation with other trains, the train operation energy consumption and the station passenger flow, so that the train operation energy consumption and the passenger waiting time are reduced on the premise of ensuring the train operation safety and accuracy. Compared with the traditional centralized scheduling, the distributed train scheduling has the following advantages:

the scheduling decision is carried out aiming at a single train, the scheduling instantaneity and flexibility are high, the scheduling method can adapt to scheduling tasks under complex scenes, and meanwhile, the scheduling method is beneficial to simplifying a train scheduling system and reducing the system construction cost;

the train operation strategy is directly generated according to the train operation environment, high coupling of generation and implementation of the train operation scheduling strategy can be achieved, intermediate processes are reduced, and reliability of train operation scheduling is improved.

According to the train autonomous scheduling principle, firstly, a train intelligent scheduling simulation module is required to be constructed, a simulation module is used for simulating an actual rail transit system, a DDQN algorithm (deep reinforcement learning algorithm) is used for training a train autonomous scheduling decision intelligent body to guarantee the train operation safety and the punctuality, reduce the train operation energy consumption and the passenger waiting time, and the train autonomous scheduling decision model subjected to deep reinforcement learning training can be used for dynamically generating a train autonomous scheduling scheme in an actual operation environment.

In reinforcement learning, the deep reinforcement learning module learns in a trial-and-error manner and guides behaviors through the return obtained by interacting with the simulation module, so that a decision main body of a given target task is realized. The interaction process of the deep reinforcement learning module and the simulation module in reinforcement learning is shown in fig. 2. When a certain task is completed, the deep reinforcement learning module firstly interacts with the simulation module through actions, the simulation module can generate a new state under the action of the actions and the environment, and meanwhile, the simulation module can give an immediate return. By so cycling, the deep reinforcement learning module continuously interacts with the environment to generate a lot of data. The deep reinforcement learning algorithm modifies the action strategy of the deep reinforcement learning algorithm by utilizing the generated data, interacts with the simulation module to generate new data, further improves the behavior of the deep reinforcement learning algorithm by utilizing the new data, and finally can learn the optimal action (namely the optimal strategy for generating the optimal action) for completing the corresponding task after repeated iterative learning.

The train dispatching in the train autonomous dispatching problem is modeled by using a deep reinforcement learning method, and a simulation module and a train dispatching intelligent agent are firstly designed. In the train operation scheduling simulation module, a plurality of entities are involved, and not all the entities need to be abstracted. The train is the most important element in the railway network system and is also the main subject of intelligent scheduling research of train operation. Therefore, in the embodiment, a train is taken as a main unit for train autonomous scheduling, and in the simulation module, other objects such as lines, stations, facility equipment, passengers and the like obtained through simulation interact with the train autonomous scheduling intelligent agent.

The train autonomous dispatching intelligent body is used as a highly autonomous entity, can dynamically acquire external environment information according to a designed rule, and has own knowledge and decision judgment capability under special conditions, so that the train autonomous dispatching intelligent body can adapt to a complex road network environment. The autonomous train dispatching intelligent agent structure is shown in figure 3. The intelligent decision-making module for train dispatching is the core part of the intelligent agent for train autonomous dispatching, and through the module, the intelligent agent for train autonomous dispatching can obtain the relation with other trains through the communication module according to the dynamic characteristics of the trains, the operation plan of the trains and the states of the fixed equipment, and can carry out intelligent decision-making in real time, thereby ensuring the safety and the accuracy of train operation, and further reducing the energy consumption of train operation and the waiting time of passengers.

The train autonomous dispatching model and the dispatching scheme are researched by deep reinforcement learning, and besides the modeling of the train autonomous dispatching intelligent body of the rail transit, the simulation of the whole actual rail transit system is needed. The rail transit system simulation comprises two parts, namely hub/station simulation and wire network transportation process simulation. The simulation is carried out on the actual rail transit system, and the mutual influence relations between trains, between trains and transportation plans and between trains and passenger flows can be considered in the train autonomous dispatching model, so that the safety and the punctuality of train operation are ensured by optimizing the train autonomous dispatching scheme, and the train operation energy consumption and the passenger waiting time are reduced.

1. Hub/station simulation

The hub/station simulation comprises a hub/station model building part, a train in-and-out operation simulation part, a hub/station internal flow line simulation part, a hub/station passenger microscopic behavior simulation part and the like. The main simulation contents of the hub/station simulation are shown in fig. 4.

And (3) environment construction: pivot network construction

The construction of the hub network is mainly realized according to a hub topological network diagram and the layout of main facility equipment in the hub, and the constructed hub network needs to reflect the relative relationship of facility equipment in the hub, the logical relationship among main functions and the logical relationship between the interior of the hub and the whole network.

Train flow simulation: train in-out simulation

The simulation of the train station entering and exiting operation needs to realize that the train in the junction strictly finishes the station entering and exiting operation in the junction according to the train station entering and exiting time information and the train receiving and exiting route plan. The function needs to be based on a train schedule and a train receiving and dispatching operation plan, and the matching of a train receiving and dispatching route plan and a topological structure of a train yard in a junction needs to be realized when a train runs in a station.

Passenger flow simulation: simulation of internal flow line in junction and microscopic trip chain of passenger in junction

According to a hub network structure and main travel links (including station entrance, station exit, security check, ticket check, boarding and descending, waiting, traveling and the like) of passengers in a hub, a main passenger flow streamline in the hub is designed. The function realizes the matching of the passenger flow streamline and the constructed hub network, and realizes the dynamic simulation evaluation of the passenger flow in the hub according to the actual passenger flow demand.

And acquiring and evaluating a complete travel process of the passenger in the hub according to the travel attribute, the hub characteristic and the guiding information of the passenger, wherein the travel process comprises complete travel information and a corresponding relation between each travel link and facility equipment and traffic flow in the hub.

2. Simulation of wire mesh transportation process

The simulation of the wire net transportation process is to carry out comprehensive simulation on the line, the interval and the wire net passenger flow of the train operation. The method mainly comprises the steps of transport network construction, transport network train flow simulation, train section tracking operation simulation, network passenger flow macroscopic situation simulation and network passenger microscopic trip chain simulation. The main simulation content of the simulation of the wire mesh transportation process is shown in fig. 5.

And (3) environment construction: transport network environment set-up

The construction of the transport network environment can realize the construction of a time-space expansion network which meets the research requirements according to the topological structure of a traffic line network and the information of a train schedule. The transport network contains the main attribute information of the nodes and the arc segments, and can clearly express the relationships among the nodes, among the arc segments and among the nodes and the arc segments in the transport network.

Train flow simulation: train in-out operation simulation and train interval tracking operation simulation

The train operation simulation can realize that all trains in the network operate in the network strictly according to the arrival and departure information in the train schedule and the train operation path. The simulation of train operation requires train schedule information and train operation paths, which need to be based on the transport network.

The train section tracking simulation is to realize the safe and efficient operation of a train in a section by taking a train operation control technology as a core. The function can simulate the train tracking operation under different block systems and obtain the minimum tracking train interval time.

Passenger flow simulation: network passenger flow macroscopic situation simulation and network passenger microscopic trip chain simulation

The function takes the real-time passenger flow as input, realizes the space-time matching of the real-time passenger flow with a transport network and a traffic flow, and predicts the distribution state of the passenger flow in the network in the current and future period of time. The realization of the function is based on the construction of a transport network environment and the simulation of transport network train flow.

According to the travel attribute, the transport network characteristics and the external information of the passenger, the complete travel process of the passenger in the network is obtained through simulation, the travel process comprises complete travel information and the corresponding relation between each travel link and the transport network and between each travel link and the traffic flow, and travel chain evaluation is carried out according to the simulation result.

In the process of value function fitting, the traditional DQN algorithm has the defect of over-estimation, namely, the action value function obtained by network approximation is larger than the real action value function. The ddqn (double DQN) algorithm can effectively solve the over-estimation problem occurring in the DQN algorithm. The application of the DDQN algorithm in the rail transit system can further optimize the autonomous scheduling strategy of rail transit. When the deep reinforcement learning module is trained based on deep reinforcement learning, the aim is to reduce the energy consumption of the train and reduce the waiting time of passengers on the premise of ensuring the safety and the punctuality of each train. In order to simplify the studied train autonomous scheduling process and facilitate deep reinforcement learning modeling, the following assumptions are introduced in this embodiment:

the influence of additional acting force such as air, curves, ramps and the like is not considered in the running process of the train, and the train is regarded as a moving entity which runs on a straight track and is not influenced by other external force except traction force.

The path selection of passengers in the rail transit network obeys the shortest path principle, and in the embodiment, the predicted value of the network OD (origin, origin and Destination passenger flow) matrix is distributed on the road network according to the shortest path principle, so that the inbound passenger flow and the inbound passenger flow of each station of each line are obtained, and the obtained inbound passenger flow and inbound passenger flow are used as a decision basis based on the train autonomous scheduling scheme.

In this embodiment, a single train is used as a research object, and an attribute definition is performed on the operating state of each train, where the attribute includes: the running state S of the train at the time t for a single train is the running state S of the train at the time t_tRepresented by the following formula:

wherein L is_tRepresenting the interval time between the train and the workshop before the train at the time t; ps_tThe traffic flow vector is generated in unit time of all stations predicted at the time t when the train arrives at the stations in front according to the schedule; pt_tRepresenting the real-time passenger capacity of the train at the time t; t is_tRepresenting the total running time of the train from the departure to the arrival; l_tA one-hot code (one-hot code) for indicating the line where the train is located at the time t; y is_tIndicating the mileage position of the train on the line; v. of_tRepresenting the running speed of the train at the time t; z is a radical of_tRepresenting the acceleration of the train at time t; sigma_tIt indicates whether the train stops at the station at time t (0 indicates no stop, and 1 indicates stop).

At Ps_tIn, let Ps_t ^hIndicating the predicted waiting passenger flow (including the entering amount and the switching-in amount) generated in unit time at the front station h at the moment t when the train arrives at the station h, when the train has arrivedAt the time of passing through station h, Ps_t ^hIs 0. Then Ps_tAnd Ps_t ^hThe relationship is

。

Return function construction

For a train, the reward function R (S, A) at time t is in state S_tNext, take action A_tThe resulting reward function R (S)_t，A_t) As follows. The embodiment sets the return value of the train operation as the opposite number of various types of operation cost (or penalty value),

wherein D (S)_t) Cost for passenger waiting time, C (A)_t) Taking action A for a train_tEnergy cost of F (S)_t) For train in state S_tLower safe Interval overrun cost, B (S)_t) For train in state S_tAnd the lower train punctual overrun cost.

Cost D (S) for passenger waiting time_t) First, the network OD traffic matrix is predicted at time t. Then, determining the waiting passenger flow Ps when the train arrives at all stations along the line according to the schedule according to the passenger flow distribution_t ^h(inbound amount and swap-in amount), the detailed procedure is shown in fig. 6. Therefore, the passenger waiting time cost function D (S) in the time t state_t) As shown in the following formula:

where ω is the economic cost per passenger waiting time, L_tThe real-time interval time between the train and the front train at the moment t is shown. R_hThe representative station h is located at the second station of the train running line. If the train h is located at the 1 st station in front of the train, R_hThe value is 1, and so on; if the train has passed the trainStation h, then R_hThe value is 0.

Set the train state S at time t_tNext, the action taken may be represented as A_t(ii) a According to the train type and the related parameters of an Automatic Train Operation (ATO), if the working condition level of train operation (train traction or braking) has n levels, the driving working condition which can be adopted by the train at any time can be a positive integer set

An inner value; train driving condition

The train has a specific traction or braking acceleration. Thus, A_tShowing the operating conditions of the train from time t

Transition to the operating condition at time t +1

State transition action of (2), energy consumption cost function C (A) of the action_t) Can be represented in the following formula (I),

wherein,

is the traction power of the train at the moment t, is the discrete time interval length of the decision process, is the unit energy consumption cost, and K is an extremely large positive real number (which can be 10)⁹），b_ijFixed economic cost of loss caused by train switching primary working condition, short for fixed cost of working condition switching, function

It is clear that the train is in working condition at the moment t

In time, whether to prohibit switching to operating mode based on train operating stability and passenger comfort considerations

If the conversion is prohibited

To 1, it can be converted to 0.

When the working condition from the time t to the time t +1 is kept unchanged and the train moves in an accelerating way, the train operation energy consumption in the discrete time period is

. When the working condition of the train changes from the time t to the time t +1, the energy consumption cost of train operation can only be the working condition conversion cost (the train decelerates or moves at a constant speed)

Or the sum of the energy consumption for train traction and the energy consumption for operating condition conversion (accelerated motion of train)

Fig. 7 shows an energy consumption cost setting principle during train operation.

Study of train State S_tLower safe Interval overrun cost F (S)_t) Is represented by the formula (II) wherein L_tTo study the separation time between the train and the preceding train at time t, Md is the minimum safe separation time between trains,

the coefficient is a unit economic cost coefficient of the train interval time overrun, and K is a great positive real number.

Setting of safety interval overrun cost in tracking running process of trainAs shown in fig. 8. And when the train distance is less than or equal to the minimum safe interval time Md between the trains, the safe interval overrun cost of train operation is a maximum value K. When the distance between the trains is larger than the minimum safe interval time Md, the over-limit cost of the safe interval of the train operation is shown along with the redundant interval L_tA trend of increasing and decreasing Md.

Study of train State S_tLower train punctual overrun cost B (S)_t) The calculation method is represented by the following formula. Wherein, T_tShows the total running time of the research train from the departure to the arrival,

to investigate the minimum time allowed for the train to reach the current location,

to study the maximum time allowed for the train to reach the current location,

is the unit economic cost coefficient of the train disalignment point.

When the time of the train arriving at the station is in the shortest time

And maximum time

In between, the punctual cost of the train is 0. When the train arrives at the station earlier than

Time, train operating punctual cost is advanced by time

-T_tA linear increase; when the train arrives at the station later than

Then the train operating punctual cost is also along with the lag time T_t-

Increasing linearly. In this way, the time of arrival of the train at the station can be constrained to be within an acceptable range. The principle of the quasi-point cost setting of train operation is shown in fig. 9.

Initializing the simulation module, namely initializing the running state of the train, and exploring and collecting the system state S continuously generated by the simulation module by utilizing a greedy strategy_tAnd researching action A taken by train_tA reward function R (S) composed according to the state and the action_t,A_t) And the state S reached after the train takes action_t+1. Respectively removing the time information from the four groups to obtain the ith quadruple consisting of the current state S, the current action A, the current reward R and the subsequent state S ″

。

Wherein the-greedy strategy can randomly generate an interval [0,1 ]]If the random number is less than the preset value, randomly selecting an action from all possible actions, executing the action in a simulation module, and acquiring a return value and a next state of the action; if the random number is not less than the preset value, the running state of the train is input into the neural network of the current value function, and the action with the maximum value function is selected

As action a is currently taken.

In the DDQN, different value function neural networks are used for action selection and action evaluation, respectively, wherein the action selection uses a current value function neural network, and the action evaluation uses a target value function neural network, as shown in the following formula. Wherein γ is the discount coefficient of the return function, and is within the interval (0,1)Positive real numbers. The present embodiment selects the optimal action by means of the current parameters in the current value function neural network Q, and then passes through the target value function neural network

Parameter (2) of

The time difference target is evaluated.

When training the deep learning neural network of the DDQN, it is generally assumed that training data are independently and identically distributed, but strong correlation exists between data acquired through reinforcement learning, and the neural network is very unstable when the data are used for sequential training. Therefore, it is necessary to select the quadruple recording from the explored experience set by using the experience playback mode

。

Preferential empirical playback (priority Experience playback) is a common sampling method. The prior experience playback effectively improves the utilization efficiency of the experience data by giving larger sampling weight to the samples with high learning efficiency. The sampling weight values used for the prior empirical playback are determined based on the time difference error. Let the time difference error at the sample be

The sampling probability at the sample is shown as follows.

α is a priority playback factor with a value of 0 indicating no priority experienced playback and 1 indicating full priority experienced playback n is the size of the current experience data queue, P_iBy_iDetermine whether the fact isIn the present process, we use Proportional-type priority empirical playback (Proportional PER). In the following formula, u is a parameter added to prevent division by zero.

When using the probability distribution of the prior playback experience, since the probability distribution of the experience data and the probability distribution of the action value function are two completely different distributions, in order to compensate for this estimation deviation, Importance Sampling coefficients (Importance Sampling weights) need to be used, and the following equation defines the Importance Sampling coefficients.

Wherein N is the empirical playback queue size; beta is an importance sampling compensation coefficient, the value of which is 0 represents that the importance sampling compensation deviation is not carried out, and 1 represents that the importance compensation deviation is completely used.

The value function of the problem under study is described using a deep neural network. The network comprises an input layer, an output layer and a plurality of hidden layers as shown in fig. 10, wherein the number of the hidden layers can be flexibly configured according to actual needs, the input is the current state S, and the output is the function of all possible current train action values

A collection of (a). A. the_iIs the current ith possible train action. As described above, the targets (labels) trained by the value function neural network in DDQN are greatly different from DQN, and the evaluation value Y of the optimal action selected for the current value function neural network in the target value function neural network_t ^DDQN(one-hot encoding was done in training). According to the output value and the evaluation value of the value function deep neural network, the Loss function Loss (S) of the value function neural network_t,A_t) Represented by the following formula.

Setting the parameter set of the value function neural network as theta, and obtaining a gradient function L of the loss function L of the value function neural network to the parameter set theta by using a chain derivation rule according to the value function neural network structure, wherein the specific form of the gradient function is related to the layer number and the structure of the value function neural network structure. By breaking the time sequence in the training samples using empirical playback, the value function neural network parameter θ can be updated using the ith sample, as shown below.

When the prior experience playback is adopted, the updating formula of the value function neural network parameter theta needs to be corrected, and the updated formula of the corrected value function neural network parameter theta is shown as the following formula:

in summary, with reference to fig. 11, a detailed flow of the DDQN algorithm considering prior experience playback can be obtained, and after training of the train autonomous scheduling value function neural network based on deep reinforcement learning is completed according to the DDQN algorithm, a trained scheduling decision model is obtained, that is, a train autonomous scheduling scheme can be generated by using the scheduling decision model.

The accuracy of the simulation module is very important for model training. In the digital twins, the simulation module is always kept highly consistent with the actual rail transit system, so that the actual rail transit system state can be conveniently predicted and analyzed by utilizing simulation.

Based on the concept of digital twinning, an autonomous scheduling system as shown in fig. 12-13 is designed. The system mainly comprises a real-time data acquisition interface, a simulation module, a deep reinforcement learning module (comprising a learning type intelligent agent, a deep neural network, a cache playback memory, a return function unit and a data regularization processing unit) and a scheduling scheme module.

The real-time data acquisition interface is mainly used for acquiring real-time train operation data from an actual rail transit system and is used as a data base of the simulation module. And the simulation module is used for carrying out simulation on the actual rail transit system according to the actual system operation data acquired in real time. Because modeling and operation parameter calibration are carried out based on actual operation data, higher consistency exists between the simulation module and an actual rail transit system. The high consistency embodies the concept of digital twinning, and the simulation module is convenient to carry out prediction analysis on the actual rail transit system. The deep reinforcement learning module comprises a learning intelligent agent and other training auxiliary functions, and the learning intelligent agent and the simulation module perform interactive training to obtain a train autonomous dispatching decision model. The deep reinforcement learning module outputs the trained train autonomous scheduling decision model to the scheduling scheme module, so that an autonomous scheduling scheme is automatically generated in the running process of the train.

The actual rail transit system comprises a train running state, a facility equipment state along the line, station/hub station entrance and exit passenger flow volume and a station/hub passenger flow gathering state; the real-time data acquisition interface is mainly used for acquiring real-time train equipment monitoring data, real-time station passenger flow gathering data and real-time station in-out station flow from an actual rail transit system. The train equipment state data is used for providing a foundation for simulation of influence of a fault process of facility equipment on train operation, real-time station passenger flow gathering data provides data support for simulation of passenger flow situations, and real-time train operation actual results transmit actual train operation conditions (including information of position, speed, acceleration, position relation with other trains, schedules and the like) to the simulation module for train operation simulation.

The simulation module is an important support of the whole train autonomous dispatching system and mainly simulates the state of an actual rail transit system and the like. The system comprehensive database stores historical data, real-time data, equipment data, model data, geographic information data and a wire network three-dimensional model. The system sees the facility devices, trains and passengers as agents with independent behavior and attributes. The facility equipment is the basis of the operation of the whole rail transit system, and the simulation of the state evolution of the facility equipment realizes the simulation of the train operation condition caused by the fault of the facility equipment, including the simulation of the behavior functions of the facility equipment such as vehicles, machines, electricity, workers and systems and the simulation of the behavior states of the facility equipment such as vehicles, machines, electricity, workers and systems, so as to facilitate the training of a train dispatching model under a complex operation scene. The train dynamic operation simulation realizes the simulation of a train operation schedule, train dynamics, a driving control process and the like. The simulation of the passenger flow of the station/hub transportation realizes the simulation of the processes of the passenger flow entering and exiting the station, the passenger flow of the platform, the macroscopic passenger flow of the line and the like. The detailed design of the simulation module is described later.

The core of the deep reinforcement learning module is a learning type intelligent agent which can carry out virtual interactive training through the simulation module to realize continuous training and perfection of the train autonomous dispatching model. In order to facilitate the deep reinforcement learning training, the detailed construction and design thereof will be described later.

The scheduling scheme module mainly comprises a train autonomous scheduling scheme generation module and a train autonomous scheduling scheme transmission module. The train autonomous dispatching method comprises the steps that a train autonomous dispatching model trained and completed by a deep reinforcement learning module is based, a train autonomous dispatching scheme generating module generates a real-time train autonomous dispatching scheme, a train autonomous dispatching scheme transmission module transmits the dispatching scheme to an actual rail transit system to implement operation dispatching, and under the premise that the operation safety and the accuracy of a train are guaranteed, the train operation energy consumption and the passenger waiting time are reduced.

In the main working process of the train autonomous dispatching system, firstly, the real-time data acquisition interface acquires real-time train operation data from the actual rail transit system, and the real-time train operation data is used as a data base of the simulation module so as to ensure the high consistency of the simulation module and the actual rail transit system. And secondly, performing continuous interactive training by using the simulation module and the deep reinforcement learning module, and continuously improving the decision-making capability of the train autonomous scheduling model. Meanwhile, the trained train autonomous dispatching model can be evaluated by utilizing the simulation module. And finally, outputting the model trained by the deep reinforcement learning module to a scheduling scheme module, and transmitting the scheduling scheme generated by the scheduling scheme module based on the train autonomous scheduling scheme decision model to an actual rail transit system for implementing the scheduling scheme.

In addition, different from the traditional centralized scheduling mechanism, the intelligent scheduling method and system provided by the embodiment are mainly performed based on an autonomous scheduling mode of the train. A distributed implementation architecture for train autonomous dispatch is shown in fig. 14. In the autonomous dispatching mode, the dispatching of the trains is completed by the autonomous dispatching intelligent agent of each train. Each vehicle-mounted autonomous dispatching intelligent agent is a set of train autonomous dispatching system based on digital twins, and the system has the autonomous evolution capability of a dispatching algorithm. In the autonomous scheduling distributed implementation architecture, the central function of the original centralized scheduling is further weakened, and only the global information sharing function is assumed. The global information sharing service integrates various information such as transportation schemes, facility equipment states, station/hub passenger flow states, operation environments, faults, emergencies, passenger services and the like in the range of the rail transit network to form an information sharing resource pool, and shares all trains in the range of the rail transit network as the basis for perfecting and training a digital twin simulation model. In the line range, real-time information interaction can be carried out among multiple trains of vehicles through the Internet of vehicles, and the real-time performance and accuracy of local information in the line range are further improved.

In addition, the data acquisition interface not only sends acquired data to the simulation module, but also directly sends the data to the deep reinforcement learning module, the deep reinforcement learning module carries out off-line training based on real-time data sent by the data acquisition interface, and the off-line training, the virtual interaction training between the deep reinforcement learning module and the simulation module are synchronously carried out, so that the continuous evolution of the train autonomous dispatching model is realized.

The data acquisition interface further comprises a real-time system transportation situation prediction module, the simulation module sends the new train operation state obtained by interaction with the depth reinforcement learning module to the real-time system transportation situation prediction module, and the result obtained by prediction is sent to an actual rail transit system after the prediction of the real-time system transportation situation prediction module.

The simulation module is an important support of the train autonomous dispatching system and mainly comprises a simulation engine, a rail transit system simulation kernel, a simulation data acquisition and monitoring unit, a train dispatching scheme simulation realization interface, a technical index statistics and evaluation unit, a dispatching strategy return evaluation unit, a three-dimensional display unit and the like. The detailed structure of the simulation module is shown in fig. 15-16. In order to illustrate the relationship between the simulation module and the deep reinforcement learning module, the deep reinforcement learning module and the interaction interface and relationship between the deep reinforcement learning module and the deep reinforcement learning module are also included in fig. 15 to 16.

The simulation engine is a bottom support for the operation of the simulation module and mainly comprises simulation module operation control, interface interaction and basic data. The operation control of the simulation module mainly comprises resource allocation, communication management, rhythm control and scene introduction during system operation, and the operation standard of the simulation module is formulated. The interface interaction mainly comprises parameter adjustment, event input and system editing, and is mainly controlled by a simulation worker. The basic data comprises composite network three-dimensional model data, composite network topology data, facility equipment attribute data, evaluation and analysis scene data and macroscopic real-time passenger flow demand data.

The rail transit system simulation kernel mainly comprises a transportation and passenger flow operation evolution simulation module and a facility equipment state evolution simulation module, frequent interaction exists between the two parts, and the two parts are continuously influenced mutually in the simulation operation process so as to simulate the actual rail transit system operation process. The simulation of the transportation and passenger flow operation evolution mainly aims at the simulation of train operation, real-time passenger flow and stations, and comprises network passenger flow macroscopic situation simulation, hub interior passenger flow simulation, individual microscopic trip chain simulation, train tracking operation simulation, train operation schedule simulation and train stop and take-off simulation. The facility equipment state evolution simulation module mainly comprises a train, machine, electricity, power and system facility global function behavior simulation and a state evolution process.

The simulation data acquisition and monitoring unit is used for carrying out omnibearing data acquisition on a simulated rail transit system in a rail transit system simulation kernel, monitoring the train running state, the passenger travel chain, the facility equipment state and the station/hub passenger flow, collecting the acquired data, supporting the functions of technical index statistics and evaluation and dispatching strategy return evaluation, and serving as a training data support of a deep reinforcement learning module.

The simulation of the train dispatching scheme realizes that the interface carries the dispatching scheme explored by the learning type intelligent agent in the deep reinforcement learning module, and the dispatching scheme is implemented in the simulation kernel of the rail transit system.

The technical index counting and evaluating unit counts and evaluates the technical indexes of the train punctuality rate, the operation safety, the train operation energy consumption, the passenger waiting time and the like according to the operation state data of the rail transit system simulated in the simulation kernel provided by the data acquisition and monitoring unit.

The train dispatching strategy return evaluation unit extracts related data from the data acquisition and monitoring unit according to an implementation structure of a train dispatching scheme injected by a train dispatching scheme simulation realization interface in a simulation module, carries out calculation of punctual return, safety return, energy consumption return and passenger waiting time return of the train dispatching scheme, and calculates the calculated return again by entering a return function calculation unit in a deep reinforcement learning module to obtain the train punctuality overrun cost, the safety interval overrun cost, the energy consumption cost and the passenger waiting time cost.

The three-dimensional display unit is directly connected with the simulation engine and the rail transit simulation kernel, displays the states and behaviors of facility equipment, the behaviors of passenger flows/passengers in a station/hub and the three-dimensional simulation of the train running process in real time through a three-dimensional model, and is convenient for researchers to observe and analyze the simulation process visually.

The building and running processes of the simulation module are not independent of the support of the simulation engine, the simulation module can use simulation software such as analog as a platform foundation, on the basis of a software platform, in order to realize the simulation of the rail transit system, the simulation engine needs to have the input and management functions of a large amount of basic data such as a composite network, passenger flow, facility equipment attributes and the like, meanwhile, mechanisms such as communication management, resource scheduling and the like among the functions of the simulation module also need to be clear in the simulation running control, and the simulation engine also needs to provide a friendly and convenient interactive interface, so that researchers can edit and modify the simulation model conveniently.

The rail transit system simulation kernel is mainly used for simulating the operation conditions of an actual rail transit system, such as the operation process of a train, the function and state evolution of facility equipment, a passenger micro travel chain, a macro passenger flow situation, passenger flow organization in a station or a junction and the like. The simulation of the state and the function of the facility equipment is the basis of the simulation of transportation and passenger flow, the function exertion of the rail transit system is determined, and the state of the transportation and the passenger flow adversely affects the load degree of the facility equipment, so that the state and the function of the facility equipment are affected. It can be said that the two have a relationship of mutual influence and mutual restriction in the simulation process.

In the simulation process, aiming at the operation state of the rail transit system simulated in the rail transit system simulation kernel, the data acquisition and monitoring unit acquires data in all directions, and after the data are collected, the data support system supports technical index statistical evaluation and scheduling strategy return evaluation on one hand, and can be used as training input of a learning type intelligent agent in a deep reinforcement learning module on the other hand. This process in the simulation module is called: the data acquisition, index and return evaluation and train autonomous scheduling model training process form an uplink loop of data acquisition, data collection, index and return evaluation and autonomous scheduling model training in the simulation system.

In the simulation module, an uplink loop is formed by data acquisition, data collection, index and return evaluation and autonomous scheduling model training. Meanwhile, a learning type intelligent agent in the deep reinforcement learning module, a scheduling scheme module and a train scheduling scheme simulation realization interface form a downlink loop of the simulation module. The control interface for realizing the dispatching scheme is the core of a downlink loop of the simulation module, and the main task of the control interface is to input the train dispatching scheme explored by the deep reinforcement learning into the simulation module and carry out corresponding operation effect evaluation and analysis.

The uplink loop and the downlink loop of the simulation module form a framework of the simulation training of the whole deep reinforcement learning autonomous scheduling model. Firstly, the data acquisition and monitoring unit can provide training data input for training of the deep reinforcement learning train scheduling model, so that the track traffic transportation situation can be more deeply recognized, and a targeted scheduling scheme decision can be developed. Secondly, inputting the train dispatching scheme explored by the deep reinforcement learning into a simulation module to implement implementation, and performing simulation on the implementation effect of the train dispatching scheme in the simulation module. And finally, the simulation data acquisition and monitoring unit carries out the return evaluation of the train dispatching scheme by acquiring the running condition data of the rail transit system in the simulation module, and can obtain a return feedback signal of the train dispatching scheme, thereby supporting the iterative loop training and the optimized evolution of the train autonomous dispatching model.

The deep reinforcement learning module is the core of the whole train autonomous dispatching system. The system mainly comprises a learning type intelligent agent, a deep neural network, a cache playback memory, a return function unit, a data regularization processing unit and a data transmission unit. The core of the deep reinforcement learning module is a learning type intelligent agent.

In the training process of the deep neural network, the learning-type intelligent agent firstly explores in the simulation module, carries out regularization processing operation on data of a data set (including a current state S, a current action A, a current reward R and a subsequent state S') collected by the simulation data collection and monitoring unit, and automatically introduces the regularized data into the cache playback memory. Randomly extracting data from the buffer feedback device to train a deep neural network, judging whether the value function neural network parameters reach a preset condition, stopping updating the value function neural network parameters if the value function neural network parameters reach the preset condition, and continuously updating the value function neural network parameters if the value function neural network parameters do not reach the preset condition, so that the decision-making capability of the learning type intelligent agent is improved. And the follow-up agent performs a new round of exploration, data acquisition and training under the updated neural network value function, so as to realize the process of continuously optimizing the decision-making capability of the agent. In this process, the communication framework setup between the learning agent and the simulation module can facilitate the communication of status, actions and benefits between the learning agent and the simulation module.

The learning type intelligent agent can interact with the simulation module, so that training of the train autonomous dispatching algorithm is achieved. The evolution process of the learning agent and what actions to take are related to the final training objective. That is to say what action is currently taken, the entire task sequence can be optimized. How to optimize the whole task sequence requires the learning agent to continuously interact with the simulation module, and try continuously, because the learning agent does not know which action is beneficial to achieving the goal in the current state at the beginning. In this embodiment, the objective of the learning agent is to reduce train operation energy consumption and waiting time of passengers under the condition of ensuring train operation safety and punctuation, so as to reduce unit cost of operation while improving user experience.

The deep neural network is an important component of deep reinforcement learning, is an important means for fitting a value function, and is used for selecting n quadruples (S, a, R, S '), calculating gradient values of the n quadruples (S, a, R, S'), and updating a value function neural network parameter by using the gradient values. Deep reinforcement learning is a product of deep learning and reinforcement learning. In the deep reinforcement learning module, a deep neural network is used for storing a value function neural network structure and related parameter states. The effective representations learned by the learning agent are stored in the deep neural network of the value function.

When the deep neural network of the value function is trained, the precondition is that training data are independently and identically distributed, but the data acquired through reinforcement learning have relevance, and the deep neural network has instability when the data are used for sequential training. Therefore, the learning agent can store the observed data in the database of the cache replay memory, extract the data from the cache replay memory by using a random sampling method during training, and train the deep neural network by using the extracted data. The method breaks the relevance existing between data, and effectively improves the stability and description capability of the deep neural network.

The return function unit defines the specific tasks required to be completed by the learning type intelligent agent in the deep reinforcement learning. Therefore, the optimal strategy learned by the reinforcement learning is corresponding to a specific task, and the setting of the return function unit also determines the specific behavior and decision mode of the learning type intelligent agent. In this embodiment, the reporting function unit includes passenger waiting time cost, energy consumption cost of train action, safety interval overrun cost and punctuation overrun cost calculation engineering, and finally obtains total reporting R (S, a), and the reporting function unit makes clear that the training target of the learning agent is to reduce train operation energy consumption and passenger waiting time under the condition of ensuring train operation safety and punctuation.

In deep reinforcement learning, the data regularization processing unit mainly performs regularization processing on input training data. Because the value function deep learning network requires input variables to be subjected to regularization (data values and dimensions meet a certain requirement), input data needs to be subjected to regularization processing (including operations of standardization, dimension reduction, completion and the like on the data), and thus training and description effects of the deep neural network are improved.

In addition, in the whole deep reinforcement learning module, real-time communication needs to be performed between the functional units. Therefore, the data transmission unit can satisfy the real-time communication between the learning agent and the simulation module, between the learning agent and the cache replay memory, between the deep reinforcement learning and the report function unit, and the like. Under the guarantee of the data transmission unit, the learning type intelligent agent can perform high-efficiency interaction with the simulation module, and simultaneously perform real-time training and data parameter storage, so that continuous training and evolution improvement of the train autonomous dispatching model are realized.

In the train autonomous dispatching system based on the digital twin, a train autonomous dispatching model based on deep reinforcement learning is a training target of a deep reinforcement learning module and is the core of the system. The training process of the train autonomous dispatching model is mainly based on the dynamic interaction of the simulation module and the deep reinforcement learning module. In the model training process, firstly, simulation modeling needs to be carried out on the train operation process or the actual rail transit system train operation process needs to be monitored, and operation state data related to a scheduling decision in the train operation process is collected in real time and serves as a decision basis of the train autonomous scheduling model.

The train running state data mainly comprises detailed information such as train numbers, train distance to the front, number of people waiting at the front station, train passenger capacity, total running time, a train line, train running mileage, train running speed, train acceleration, whether the train stops at the station and the like. The data mainly come from the train operation data of a simulation module or an actual rail transit system and are preprocessed by a regularization processing unit in a deep reinforcement learning module.

And the return function related data of the train autonomous dispatching model is used for describing a training target of the train autonomous dispatching model, and is calculated, stored and managed by a return function unit in the deep reinforcement learning module. The data related to the train autonomous scheduling return function comprises passenger waiting time cost, energy consumption cost of train actions, safety interval overrun cost and punctual overrun cost. The data are derived from a scheduling scheme return evaluation function module in a simulation module to carry out data collection and preliminary calculation, and return function units in a deep reinforcement learning module carry out final calculation.

The train autonomous dispatching model related data is mainly used for storing training data and a model parameter set related to a value function neural network based on deep reinforcement learning. As described above, the input data for training the valued function neural network is a four-tuple data set, which is a data set consisting of the current state of the system, the action taken, the total reported value, and the state of the next step. The value function neural network parameter set is used for dynamically storing model parameter values in the training process, and is the key for generating an autonomous scheduling decision scheme by the model. In the proposed train autonomous dispatching system based on the digital twin, the train autonomous dispatching model has a continuous optimization process, so the stored model parameters are also dynamically updated.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An autonomous train dispatching method is characterized in that,

the simulation module receives the rail transit data and simulates an actual rail transit system;

the simulation module simulates the train running state and outputs the train running state to the scheduling scheme module;

the scheduling scheme module generates a scheduling scheme based on the train running state;

the scheduling scheme module transmits the scheduling scheme to an actual rail transit system;

the interactive training of the simulation module and the deep reinforcement learning module comprises the following steps:

the simulation module generates a train running state S;

based on

Generating a train action A by a strategy;

the deep reinforcement learning module calculates the return based on the train running state S and the train action A

；

The simulation module generates a subsequent train operation state based on the train operation state S and the train action A

；

N train quadruples (S, a, R,

）；

the data is generated using the quadruple (S, a, R,

) Updating the value function neural network parameters until the value function neural network parameters reach a predetermined condition;

reporting

Including passenger latency costs

For passenger latency costs

Firstly, predicting a network OD passenger flow matrix at the time t, and then determining the waiting passenger flow of a train reaching all stations along the line according to a schedule according to passenger flow distribution

,

Representing the amount of inbound and inbound swapping, passenger waiting time cost at time t

As shown in the following formula:

wherein,

economy in the unit of passenger waiting timeThe cost of the process is reduced, and the cost of the process,

the real-time interval time between the train and the front train at the time t,

the representative station h is located at the first station of the train operation line, and if the train is located at the 1 st station of the train operation line, the representative station h is located at the second station of the train operation line

The value is 1, and so on.

2. The train autonomous scheduling method of claim 1, wherein the rail transit data includes train dynamics data, train operation plan data, fixed equipment state data, external environment data, passenger flow data, and train-to-train relationship data.

3. The train autonomous scheduling method of any of claims 1-2 wherein the scheduling plan module further delivers the scheduling plan to a simulation module, the simulation module forming a train operating status based on the scheduling plan simulation.

4. The train autonomous scheduling method of claim 3, wherein the simulation module evaluates the formed train operation state.

5. The train autonomous scheduling method of claim 4, wherein the indicators of the evaluation of the train operation state by the simulation module include train punctuality rate, operation safety, train operation energy consumption and passenger waiting time.

6. The train autonomous scheduling method of claim 3, wherein the simulation module calculates a reward based on the scheduling scheme and feeds the reward back to the deep reinforcement learning module, and the deep reinforcement learning module adjusts the scheduling decision model based on the reward.

7. The train autonomous scheduling method of claim 6, wherein the reports calculated by the simulation module based on the scheduling scheme include a train punctual report, a train safety report, a train energy consumption report and a waiting time report.

8. The train autonomous scheduling method of any one of claims 1-2,

the deep reinforcement learning module receives rail transit data;

9. An autonomous train scheduling system, comprising:

the simulation module is used for receiving the rail transit data, simulating an actual rail transit system, performing interactive training with the deep reinforcement learning module, generating a train running state and outputting the train running state to the scheduling scheme module;

the scheduling scheme module is used for receiving the scheduling decision model trained by the deep reinforcement learning module, generating a scheduling scheme based on the train running state and transmitting the scheduling scheme to the actual rail transit system;

the simulation module generates a train running state S;

based on

Generating a train action A by a strategy;

；

N train quadruples (S, a, R,

）；

the data is generated using the quadruple (S, a, R,

reporting

Including passenger latency costs

For passenger latency costs

,

As shown in the following formula:

wherein,

the economic cost caused by the waiting time of the unit passenger,

The value is 1, and so on.

10. The train autonomous dispatching system of claim 9, further comprising a data acquisition interface, wherein the data acquisition interface is configured to acquire data of an actual rail transit system and send the acquired data to the simulation module.

11. The train autonomous dispatching system of claim 10, wherein the data collection interface is further configured to send the collected data to the deep reinforcement learning module.

12. The train autonomous dispatching system of claim 9, wherein the simulation module comprises:

and the scheduling strategy return evaluation unit is used for calculating return on the scheduling scheme based on the data acquired by the simulation data acquisition and monitoring unit and feeding back the return to the deep reinforcement learning module.