CN115731724A

CN115731724A - Regional traffic signal timing method and system based on reinforcement learning

Info

Publication number: CN115731724A
Application number: CN202211438816.3A
Authority: CN
Inventors: 王海泉; 费云帆
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-03-03

Abstract

The invention relates to a regional traffic signal timing method and system based on reinforcement learning, and belongs to the technical field of traffic signal timing processing. According to the method, after regional traffic environment data and vehicle track data are extracted to obtain intersection and signal lamp constituent elements, a regional traffic signal timing task is determined based on the intersection and signal lamp constituent elements, a regional traffic waiting model is constructed based on the regional traffic signal timing task, finally, a multi-agent reinforcement learning algorithm is adopted to learn and train the regional traffic waiting model, the coordination optimization process of all intelligent agent strategies in a region is completed by combining an optimized NAC algorithm based on a strategy, an RNN (radio network) is introduced to process an optimization coordination result generated in the coordination optimization process to obtain a regional traffic signal timing scheme, and therefore the problem of dimension disaster in the existing traffic signal timing process is solved.

Description

Regional traffic signal timing method and system based on reinforcement learning

Technical Field

The invention relates to the technical field of traffic signal timing processing, in particular to a regional traffic signal timing method and system based on reinforcement learning.

Background

Currently, several types of traffic signal timing methods are formed in the world. From the control range, the single-point control is developed to the trunk control and the regional control nowadays. The method slowly evolves from an initial statistical scheme to a new scheme of various artificial intelligence algorithm learning control.

1. Historical data statistical prediction method

The historical data statistical prediction method is applied to the traffic signal timing algorithm for the longest time. On one hand, due to the lack of the conventional traffic flow data collection means, only simple data such as the number of passing vehicles at the intersection and the like can be obtained, and the characteristics of the data which can be extracted are less; on the other hand, the historical data statistics and prediction based method is simple to operate, the traffic flow of a certain time period is counted, the most value of each lane direction is accumulated and solved, a flow mathematical model is established, corresponding threshold detection is carried out, a preliminary single intersection traffic signal timing scheme can be obtained, and finally, timing period unified joint adjustment is carried out on intersections in the region, and a region traffic signal timing result can be obtained. Although the historical data statistical prediction method is simple to apply and has a poor landing effect, with the diversity and intellectualization of traffic flow data acquisition, when regional traffic signals are not sufficiently supported only by simple traffic data prediction, the relevance of each intersection is not considered, many characteristics of traffic flow are not extracted, and the simulation of real traffic conditions is lacked.

2. Value-based reinforcement learning algorithm

With the development of artificial intelligence technology in recent years, more and more artificial intelligence algorithms are widely applied to the traffic field. The method based on reinforcement learning shows superiority in traffic signal control and timing. The reinforcement learning can adjust the action taken by the agent according to the environmental feedback, needs less environmental prior knowledge, can adapt to the real-time change of the traffic condition and has better traffic interpretability. The reinforcement learning has the characteristics of data driving, self learning and no model. The key process is as follows: the agent takes action to change its state, receive rewards, and interact with the environment in a cyclic process. The goal of reinforcement learning is to maximize long-term future rewards, i.e., solving the optimal reward function to maximize it. Through linkage of a plurality of intelligent agents, local optimization is slowly expanded to global optimization, and optimal solutions or better solutions of global optimization of a plurality of intersections in an area are found, so that the intelligent agent reinforcement learning is a direction which is good at all.

On one hand, the early multi-agent reinforcement learning traffic signal control method is limited in the capability of collecting data at that time, the traffic data volume is small, and the achievement does not draw wide attention; on the other hand, the method is only suitable for the urban area multi-intersection scene of the smaller-scale urban traffic. In the implementation process, when the evaluation function and the Q learning algorithm are combined to control the traffic signals of multiple intersections, intersection traffic state description is adopted, and the method generates a typical dimension disaster problem of regional traffic due to the complex urban traffic state. For the problem of dimension explosion, it is a point where regional traffic signal timing is not negligible. With the increasing number of intersections and the doubling of the route space, dimensional disasters can occur, so that this method is not feasible in the case of large traffic networks. Some scholars combine reinforcement learning by using methods such as nonlinear estimation, function approximation or neural network, but problems that model learning and prediction time is too long, convergence may not be guaranteed and the like occur.

In addition, another problem that arises during traffic signal timing is intersection phase coordination. Many methods are based on respective independent agent research, and parallel computing is applied to urban multi-intersection; some direct applications of reinforcement learning to multiple intersections deal with the problem of random traffic patterns generated by urban traffic. In the urban traffic signal decision process, information such as traffic flow, waiting time and the like of adjacent intersections are considered, but a coordination mechanism is lacked, and the association relationship of each intersection is not reasonably utilized. To solve this problem, it is generally implemented by multi-agent reinforcement learning to automatically discover a more efficient zone signal controller. Wherein, each intelligent agent controls one signal lamp, and the coordination between adjacent signal lamps is expanded, and the Max-Plus algorithm is used for realizing. The Max-Plus algorithm is simple and only considers the coordination between adjacent intersections. When an external coordination mechanism is researched to solve the problem of control of regional multi-interface multi-intelligent reinforcement learning, the Max-Plus algorithm is also used for prediction, but the Max-Plus algorithm has certain limitation and high calculation complexity and is only suitable for a tree network.

3. Reinforced learning algorithm based on strategy

The strategy gradient algorithm belongs to the other branch of reinforcement learning, and mainly solves the problem of discrete action space in deep reinforcement learning. The policy gradient algorithm is a more direct method. For the deterministic strategy, the neural network directly outputs a strategy function, namely what action should be executed under a certain state; for a non-deterministic policy, probability values are output for performing various actions in this state. Some methods provide a traffic signal control method based on Q learning, the Q learning stores a value function through a Q value table, the method cannot adapt to a complex environment, and if a state space is too large, the problems of slow storage and convergence are brought.

The strategy gradient idea was first embodied in the NAC algorithm. The PG method converges slower than the value method due to the high variance of the gradient estimation. The natural operator-critical method (NAC) improves this by combining the PG method, natural gradients, value estimation and least squares time difference Q learning. The Deep Deterministic Policy Gradient (DDPG) algorithm is a variant of the NAC algorithm, which implements a state sharing mode, considering the state information of the entire road network for each intersection. The global optimal Q value is deduced by estimating the mutual correlation of the intersections, so that a plurality of agents can acquire the temporal-spatial information of each other according to the global state, the strategies of the agents are reasonably adjusted, and the cooperative optimization is realized. According to the traffic flow information of different phases and different lanes, the timing period, the phase sequence and the duration of each phase are intelligently decided, and the problem of suboptimal timing caused by a discrete action decision space based on the existing intelligent algorithm is solved.

In summary, although the application and effect of the reinforcement learning algorithm in the problem of regional signal timing have been developed rapidly, there is still room for improvement and optimization in terms of specific regional phase coordination and multi-agent interaction, and dimension disaster caused by regional network scale change. The method is characterized in the following two aspects:

the method is an in-depth research on phase coordination and multi-agent interaction between regional intersections. In the divided traffic areas, the distance between the general intersections does not exceed 500 meters, and vehicles pass through each intersection in the area according to the driving tracks of the vehicles, so that the adjacent intersections show space-time relevance, and the problem is a point-to-line-to-surface modeling problem. The traffic flow characteristics of the intersections do not exist in an isolated manner, and the association relation of the time-space characteristics of the intersections needs to be considered in an important manner, because the congestion of a single intersection can influence the surrounding traffic areas; while optimal improvement at a single intersection may cause increased congestion at adjacent intersections. In the existing multiple reinforcement learning algorithms, each intelligent agent only considers the intersection environment to complete local intersection optimization or does not deeply consider multi-phase coordination control among intersections, so that vehicles can have short waiting time at some intersections and quickly pass through the intersections in green light phases, but the waiting time of other intersections in the area is too long, and the learning result is poor. In addition, in many traffic simulation processes, for example, a vehicle random arrival model and a transition delay model are used, so that the internal characteristics of actual regional traffic flow data are ignored to a certain extent, and the internal traffic characteristics are not extracted by using real data. Therefore, the state, action and reward of the multiple intelligent agents are suitably designed, the algorithm enables the multiple intelligent agents to obtain the space-time state characteristics of each other according to the global state of multiple intersections in the area through communication interaction and updating among the intelligent agents, the strategy function of the intelligent agents is adjusted, the cooperative optimization of the intersections in the area is achieved, and the maximization of the reward of the global area is completed.

And the second is to solve the problem of dimension disaster caused by regional road network scale change. With the increase of the number of road junctions in the area, the route selection space of vehicles is larger, the space-time characteristics of traffic data are higher in dimension, and the problem of dimensionality disaster needs to be solved in a key way. The dimension disaster refers to the problem that the performance of the algorithm is continuously increased along with the change of the number of the features, but the performance is not increased or reversely decreased after the number exceeds a certain value, which can cause that the model learning and predicting time is too long, and the convergence can not be ensured. The experimental simulation of regional traffic signal timing in the existing research is usually carried out around a small number of small-scale regional intersections such as 2-4 intersection regions, the number of simulated vehicles is small, and the problem of dimension disaster does not occur. Under the condition that the regional traffic network is complex, dimension disaster problems can occur in a plurality of algorithms with better performance.

Therefore, a new regional traffic signal timing method or system is provided, so as to introduce a recurrent neural network according to the characteristics of high-dimensional multivariable of the established model, and optimize the network updating process of the intelligent agent in each turn, thereby solving the problem of dimensional disaster, which becomes a technical problem to be solved in the field.

Disclosure of Invention

The invention aims to provide a reinforcement learning-based regional traffic signal timing method and system, which can realize a signal timing collaborative optimization process among a plurality of intersections in a region and solve the problem of dimension disaster caused by scale increase of a regional road network.

In order to achieve the purpose, the invention provides the following scheme:

a regional traffic signal timing method based on reinforcement learning comprises the following steps:

extracting data of regional traffic environment data and vehicle track data to obtain intersection and signal lamp constituent elements;

determining a regional traffic signal timing task based on the intersection and signal lamp constituent elements;

constructing a regional traffic waiting model based on the regional traffic signal timing task;

and learning and training the regional traffic waiting model by adopting a multi-agent reinforcement learning algorithm, finishing the coordination optimization process of each agent strategy in the region by combining the optimized NAC algorithm based on the strategy, and introducing an RNN (radio network) to process an optimization coordination result generated in the coordination optimization process to obtain a regional traffic signal timing scheme.

Preferably, the constructing a regional traffic waiting model based on the regional traffic signal timing task specifically includes:

constructing a regional traffic environment simulation model;

extracting vehicle track data characteristics; the vehicle track data characteristics comprise the period of the signal lamp, the green signal ratio of the signal lamp and the phase difference of the signal lamp;

and adding the vehicle track data characteristics into the regional traffic environment simulation model by adopting a Python programming algorithm to generate the regional traffic waiting model.

Preferably, the regional traffic waiting model includes:

the road network environment configuration module is used for building each traffic element in the area to generate a basic environment model;

the data acquisition module is used for adopting vehicle running data in the area and extracting vehicle track space-time characteristics;

the signal lamp configuration module is used for carrying out basic configuration on the signal lamp of each intersection in the area; the contents of the basic configuration include: the lamp color, the cycle duration and the phase sequence of the signal lamp;

and the evaluation index output module is used for determining evaluation index data of the regional traffic environment.

Preferably, the learning and training of the regional traffic waiting model by using the multi-agent reinforcement learning algorithm, the coordination optimization process of each agent policy in the region by combining the optimized NAC algorithm based on the policy, and the processing of the optimization coordination result generated in the coordination optimization process by introducing the RNN recurrent neural network to obtain the regional traffic signal timing scheme specifically include:

designing basic elements of the multi-agent; the base element includes: status, actions, and rewards;

setting a multi-agent collaborative optimization model based on the basic elements of the multi-agent;

updating the state information of each intersection lane in the area by adopting the multi-agent collaborative optimization model;

aggregating the updated state information of the lanes at each intersection to obtain aggregated state information;

screening the polymerization state information by adopting an attention mechanism to obtain screening state information;

generating a strategy function according to the screening state information by adopting an optimized NAC algorithm based on a strategy; the strategy function is the mapping from a state set to an action set; the NAC algorithm comprises the following steps: actor network and Critic network;

and inputting the strategy function into the RNN recurrent neural network to obtain the regional traffic signal timing scheme.

Preferably, the generating a policy function according to the screening status information by using the optimized NAC algorithm based on the policy includes:

each agent updates parameters of the Actor network by maximizing the reward accumulated in the future according to a corresponding objective function;

when the Actor network after updating the parameters outputs the next signal period data after selecting the action of each strategy, the criticic network evaluates the next signal period data, so that the strategy function output by the Actor network is continuously coordinated and optimized; the next signal period data includes: period, phase order, and phase duration factor.

Preferably, the parameters of the Critic network are updated by minimizing a loss function.

Preferably, the feature output at the current time in the RNN recurrent neural network is used as the input feature at the next time.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a reinforcement learning-based regional traffic signal timing method, which comprises the steps of after regional traffic environment data and vehicle track data are subjected to data extraction to obtain intersection and signal lamp constituent elements, determining a regional traffic signal timing task based on the intersection and signal lamp constituent elements, then constructing a regional traffic waiting model based on the regional traffic signal timing task, finally, learning and training the regional traffic waiting model by adopting a multi-agent reinforcement learning algorithm, completing a coordination optimization process of all intelligent agent strategies in a region by combining an optimized strategy-based NAC algorithm, and introducing an RNN (radio network) to process an optimization coordination result generated in the coordination optimization process to obtain a regional traffic signal timing scheme, thereby solving the problem of dimension disaster in the conventional traffic signal timing process.

In addition, the invention also provides a regional traffic signal timing system based on reinforcement learning, which comprises:

the data acquisition unit is used for acquiring regional traffic environment data and vehicle track data;

the memory is connected with the data acquisition unit and used for storing the regional traffic environment data, the vehicle track data and the software control instruction; the software control instruction is used for implementing the provided reinforcement learning-based regional traffic signal timing method;

and the processor is connected with the memory and used for calling and executing the software control instruction to obtain a regional traffic signal timing scheme.

Preferably, the processor comprises:

the element extraction module is used for extracting data of the regional traffic environment data and the vehicle track data to obtain intersection and signal lamp constituent elements;

the timing task determining module is used for determining a regional traffic signal timing task based on the intersection and signal lamp constituent elements;

the waiting model building module is used for building a regional traffic waiting model based on the regional traffic signal timing task;

and the timing scheme generation module is used for learning and training the regional traffic waiting model by adopting a multi-agent reinforcement learning algorithm, completing the coordination and optimization process of each intelligent agent strategy in the region by combining the optimized NAC algorithm based on the strategy, and introducing an RNN (neural network) to process the optimization and coordination result generated in the coordination and optimization process to obtain a regional traffic signal timing scheme.

Preferably, the memory is a computer readable storage medium.

The technical effect achieved by the reinforcement learning-based regional traffic signal timing system provided by the invention is the same as the technical effect achieved by the reinforcement learning-based regional traffic signal timing method, so the details are not repeated herein.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a reinforcement learning-based regional traffic signal timing method according to the present invention;

FIG. 2 is a schematic diagram of multi-agent cooperative optimization provided by an embodiment of the present invention;

FIG. 3 is a diagram illustrating an implementation of a reinforcement learning-based regional traffic signal timing method according to the present invention;

FIG. 4 is a schematic diagram of a 3 × 3 intersection area provided by an embodiment of the present invention;

fig. 5 is a statistical diagram of the flow of the simulated traffic flow at the 3 × 3 intersection area provided by the embodiment of the present invention;

FIG. 6 is a schematic diagram of a 4 × 4 intersection area provided by an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a reinforcement learning-based regional traffic signal timing system provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention aims to provide a regional traffic signal timing method and system based on reinforcement learning, which can realize a signal timing collaborative optimization process among a plurality of intersections in a region and solve the problem of dimension disaster caused by scale increase of a regional road network.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

Example one

As shown in fig. 1, the reinforcement learning-based regional traffic signal timing method provided by the invention comprises the following steps:

s1: and extracting the data of the regional traffic environment data and the vehicle track data to obtain intersection and signal lamp constituent elements. In this step, the regional traffic environment data is infrastructure data composed of intersections and lanes. For example, the traffic areas studied in this embodiment may be a 9-intersection regional road network and a 16-intersection regional road network: the 9-intersection regional road network includes 9 intersections and represents a 3 × 3 rectangular region. The 16-intersection regional road network includes 16 intersections and represents a 4 × 4 rectangular region. Regional road networks are connected by lanes at intersections, and include lanes inside the region and entrance lanes and exit lanes at the edge of the region. Each intersection data record information such as the serial number, the position and the signal lamp identification of the intersection. Each piece of lane data records information such as the number, turning direction, speed limit, length, type, start point, and end point (indicated by intersection type data) of a lane. Trajectory data of vehicles describes the behavior of vehicles controlled by signal lamps, such as passing, stopping and the like in an area, and is an important data source for algorithm training and testing. The vehicle starts from the departure time, and its trajectory passes through each intersection and the relevant lanes in the area, and finally leaves the area. The vehicle track data mainly includes vehicle type, length, current passing intersection, current passing lane, waiting time at the current intersection, whether to pass the current intersection, and start time and end time of the vehicle track.

S2: and determining regional traffic signal timing tasks based on intersection and signal lamp constituent elements. Based on the step S1, it is determined that the task is a given area G = < i, E > (i is an intersection set, E is a lane set), based on the vehicle data track data T, an area vehicle waiting model M is generated, and an algorithm f (·) is implemented. The algorithm f (-) can realize the coordination control of the regional traffic signal lamp, and finally generate a regional traffic signal timing task.

S3: and constructing a regional traffic waiting model based on the regional traffic signal timing task. The step of establishing the regional traffic waiting model is mainly divided into two major works of establishing the regional traffic environment simulation model and extracting vehicle space-time characteristics, and the regional traffic waiting model can be generated by combining the two works. The method specifically comprises the following steps:

s31: and constructing a regional traffic environment simulation model.

The method comprises the steps of establishing a regional traffic environment Simulation model based on Urban traffic Simulation Software (SUMO) and a Python programming algorithm, establishing a basic environment of a regional road network in the SUMO, controlling and optimizing the regional traffic environment by using a Python language, transmitting information between the regional traffic environment Simulation model and the SUMO through a TracI flow control interface provided by the SUMO, accessing the running regional traffic environment Simulation model, and allowing values of Simulation objects to be retrieved and behaviors of the Simulation objects to be manipulated online.

The main contents of the design of the regional traffic environment simulation model are as follows: firstly, the simulation model can construct the basic environment of the SUMO according to the regional traffic environment data, and traffic environment elements such as intersections, lanes and signal lamps in the region are generated. Based on the environment, vehicle track data are added through a Python programming algorithm in the follow-up process, corresponding space-time characteristics are collected and extracted, a final regional traffic waiting model is generated, and coordination control of regional traffic timing is carried out. In addition, the basic environment of the SUMO is packaged in the regional traffic waiting model, when the follow-up research algorithm performs universal operation on inexperienced traffic, the change of the regional road network does not influence the basic construction work of the modeling algorithm, and the follow-up research of the algorithm is facilitated. And finally, the regional traffic waiting model is connected with a database, and the evaluation indexes of the average waiting time and the average queuing length required by the regional traffic signal timing algorithm are calculated. The four functional modules designed by the regional traffic waiting model are respectively a road network environment configuration module, a data acquisition module, a signal lamp configuration module and an evaluation index acquisition and output module of regional traffic. The specific introduction of each module is as follows:

1) Road network environment configuration module:

the road network environment configuration module is the most basic and most core module in the regional traffic waiting model, and is mainly used for constructing each basic traffic element in a region to generate a basic environment model: firstly, the position of each intersection in the area and the distance between intersections are determined, the distance between adjacent intersections in the area is not more than 600 meters, and the simulation distance between adjacent intersections set in the embodiment is between 200 meters and 500 meters. Secondly, the intersections are connected through lanes, two-way three lanes are arranged between the two intersections, and the lanes are provided with corresponding traffic flow directions and stop line positions and steering directions at the intersections. And finally, determining the boundary of the simulation area, and configuring an inlet lane and an outlet lane of the area.

2) A data acquisition module:

the data acquisition module is mainly responsible for acquiring data of vehicles running in the area and used as one of basic acquisition functions, and the subsequent vehicle track space-time feature extraction function is realized by calling the data acquisition module. The data acquisition module simulates a data acquisition device in actual traffic, and is provided with acquisition devices at each road junction and lane to acquire the behavior of the vehicle in real time, wherein the data acquisition module mainly acquires the driving speed of the vehicle, the traffic flow and the number of waiting vehicles at the road junction and stores the acquired data in a corresponding database.

3) The signal lamp configuration module:

the signal lamp configuration module is used for basically configuring signal lamps of the intersection. And configuring the lamp color, the basic period duration, the basic phase sequence and the like of signal lamps at the intersection according to the designed four-phase signal control. And as a basic signal lamp configuration function, the signal lamp configuration module is called and optimized by a subsequent multi-agent reinforcement learning algorithm. In SUMO, the TraCI interface may be invoked to obtain the signal lamp variables, and return the state of the variables or values queried in the last simulation step, or obtain additional parameters of the signal lamp, including the signal lamp number, the lamp color (0-red, 1-yellow, 2-green), and the like.

4) An evaluation index output module:

the evaluation index output module outputs the collected and calculated related evaluation index data. The evaluation index is used for overall evaluation of the regional traffic environment and also is packaged as a function in the basic environment, the collection and calculation processes of the evaluation index do not change along with the change of the regional traffic condition, and the output evaluation index data is used as important analysis and comparison of the algorithm.

The evaluation index output module is mainly designed for encapsulating the basic environment of regional traffic and facilitating the calling of subsequent algorithms and other functions. The basic environment model constructed in the embodiment is suitable for traffic timing algorithms of areas with different sizes, traffic states of traffic flow passing and areas of different types. When various changes are involved in subsequent researches, the modeling method of the basic environment model is not required to be changed, and the method has good expandability.

S32: and extracting vehicle track data characteristics.

The track data comprises tracks passing through each lane in the area from the departure time of the vehicle, passing through each intersection according to a signal lamp timing scheme, and finally leaving the area. The method contains rich time and space characteristics and is an important training element of a subsequent reinforcement learning algorithm. The time and space characteristics of vehicle track data are accompanied with a signal lamp timing control process in a region, a vehicle is located at different positions in the region at different moments, and the characteristics are mainly reflected on three time parameters of signal lamp period, split green ratio and phase difference except that the track can drive according to a set route.

1) Period of time

The period is a parameter which is mainly designed in the signal lamp timing method, the period duration is formed by accumulating the duration of each phase of the intersection, the average waiting time of the vehicle is increased when the period duration is too long, and the vehicle is frequently in a waiting state when the period duration is too short. In the embodiment, the period settings of each intersection in the area are different, and the period settings are changed in real time according to the training result, so that the duration and the sequence of each phase in the period are adjusted, and the signal timing is more flexible. However, the cycle time of the method is limited by upper and lower thresholds, which are:

T _min ≤T _i ≤T _max ,i∈Ι

wherein I represents the number of the crossing in the area, I is the set of crossing numbers, T _i The cycle time, T, of the signal lamp at the ith intersection in the area _min Indicating the maximum between periodic flowersSmall value, T _max Represents the maximum value of the cycle time, wherein T _min And T _max 10 seconds and 160 seconds may be taken, respectively.

2) Lucent ratio

The split is a design parameter for different signal phases. In this embodiment, the simulated regional road network has lanes in the east-west direction as main trunks, and has a large traffic volume. The south and north directions are used as branch lines, and the traffic flow of each branch line is small. This feature is embodied in the design of the green ratio of the signal lamp. The first and second phases in the east-west direction are longer in duration and the green lamp is longer in duration. The third phase and the fourth phase in the north-south direction are shorter in duration, and the green lamp is shorter in duration. The signal timing scheme in this embodiment may cause a certain phase to occupy a complete cycle under some calculation conditions, and the magnitude of the split ratio has a maximum limit:

λ _ij ≤λ _max ≤1,i∈Ι,j∈{1,2,3,4}

in the formula, j represents the phase number of the signal lamp at the intersection, and the phases are divided into 4 in total and are numbered as 1,2,3 and 4. Lambda [ alpha ] _ij Is green signal ratio, lambda, of the j-th phase of signal lamp in the ith crossing _max Is the maximum split.

3) Phase difference

The phase difference is the comprehensive embodiment of the space-time characteristics of the vehicle track data and is a parameter which is mainly considered in the design of signal lamp timing. The following discussion of this embodiment is relative phase difference, which is the difference of the green light starting time of a certain phase at the adjacent crossing. The signal timing scheme based on phase difference coordination can well form green wave bands for vehicles to pass through between adjacent intersections, and the average waiting time of traffic flow in the coordination direction is greatly shortened. Straight-going vehicle from crossing A to crossing B, at t ₁ When the time reaches the intersection A, the green lamp of the straight-going phase is lighted up, and t is ₂ The moment arrives at intersection B, at which time the coordinated phase causes the vehicle to just encounter the green light to pass.

Namely the phase difference of the coordination of the directions from the intersection A to the intersection B, the same way is adoptedThere will also be a phase difference in the opposite direction

The two coordination phase differences satisfy:

in the formula (I), the compound is shown in the specification,

the phase difference of the coordination from the intersection A to the intersection B can be reversed

The numbers of two intersections under the formula are respectively, and n is a positive integer.

A long short term memory network (LSTM) is used to extract spatiotemporal features contained in the vehicle trajectory data. The calculation method of each parameter in the LSTM network is as follows:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )

in _t ＝σ(W _in ·[h _t-1 ,x _t ]+b _in n)

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o )

h _t ＝o _t ·tanh(c _t )

wherein f represents a forgetting gate, in represents an input gate,

indicating the current sheetThe state of the element, c represents the updated state of the current element obtained according to the state of the current element and the input of the current time step, o represents the output gate, h represents the hidden state of the current time step, t represents the time, tanh (star) represents the hyperbolic tangent function, W _f To forget the door weight, W _in Is input the gate weight, W _o For outputting the gate weight, W _c Is the state weight, x _t Is the input value at time t, b _f Offset coefficient for forgetting gate, b _in As the offset coefficient of the input gate, b _o As the offset coefficient of the output gate, b _c Offset coefficient as state, sigma as Sigmoid activation function, c _t Is long-term information, h _t Is short-time information.

The LSTM network is capable of processing the spatiotemporal characteristics of each data point in the vehicle trajectory and using the learned memory for subsequent feature extraction of the signal timing parameters. The data represented by the feature vectors are input to the LSTM layer, which performs calculations in terms of the timing of the data.

S33: and generating a regional traffic waiting model.

And combining the regional traffic environment simulation model in the step S31 and the vehicle track data characteristics extracted in the step S32 to generate a regional traffic waiting model. The regional traffic waiting model completes the realization of a regional traffic signal timing algorithm based on reinforcement learning in a subsequent configuration algorithm. The regional traffic waiting model connects and exchanges information with the previous regional traffic environment simulation model through the TracI interface in the SUMO. The regional traffic waiting model is generated as follows:

on the basis of the original module, a region signal parameter control module and a signal timing output module are added according to the extracted data space-time characteristics. And for the regional signal parameter control module, according to the characteristics of the extracted signal timing parameters, a subsequent multi-agent reinforcement learning algorithm is applied to complete control over three parameters, namely a signal period of each intersection in the region, a phase green signal ratio of each direction and a phase difference of adjacent intersections, and the waiting time function of the vehicles in the region is iteratively optimized. And the signal timing output module receives the algorithm calculation result from the signal parameter control module and outputs a real-time regional traffic signal timing scheme according to each intersection and each phase to coordinate the passing of vehicles in the region.

And S4, learning a traffic waiting model in the training area by adopting a multi-agent reinforcement learning algorithm, finishing the coordination optimization process of each agent strategy in the area by combining the optimized NAC algorithm based on the strategy, and introducing an RNN (radio network) recurrent neural network to process an optimized coordination result generated in the coordination optimization process to obtain an area traffic signal timing scheme.

In step S4, according to the multi-agent reinforcement learning process, the method specifically comprises the following steps:

s41: designing basic elements.

In the problem of regional traffic signal timing in this embodiment, an agent is set for each intersection in a region to control signal timing of the intersection, and multiple agents are set to be distributed at each intersection in the region, and traffic signal timing optimization in a complete region is completed by adopting communication interaction between the multiple agents and the environment. In reinforcement learning, three factors of the state, action and reward of a multi-agent are designed firstly, specifically:

1) Status of state

The state is a quantitative representation of the regional road network environment observed by the agent at each intersection, and the state space should reflect the characteristics of the regional road network as completely as possible. In a regional traffic environment, the traffic flow can reflect the real traffic condition in real time, and the vehicle queuing length can reflect the congestion condition of each direction and each lane in a regional road network. Thus, the observed local environmental state of the agent at intersection i at time t is first defined as ob _i,t ：

ob _i,t ＝{f_lane ₁ ,f_lane ₂ ,…,f_lane _{n_lane} ,q ₁ ,q ₂ ,…,q _{n_lane} }

In the formula, n _ lane represents the total number of the entrance lanes of the intersection i, and the subsequent calculation of the algorithm and the experimental statistics are all based on the traffic condition of each lane. f _ lane represents the traffic flow of the corresponding lane. And l represents a vehicle queuing length of the corresponding lane. The road network state of the complete region includes observations of all agents, given by the following observation vectors:

2) Movement of

At each moment, the intelligent agent receives the state information of the intersection and needs to make corresponding action, namely a corresponding timing scheme. The selection of the agent action will directly affect the effect of the timing scheme. Existing multi-agent based timing algorithms typically use phase as the update frequency to select actions. The phase change is performed in a discretized motion space, which causes quantization errors and reduces timing utilization. In order to overcome the problem caused by the discrete motion space, the present embodiment uses the motion factor related to the signal period, and the motion space of the motion factor is continuous, and the timing flexibility is higher, which is one of the innovative points of the present embodiment. The action of agent i at time t is defined as:

a _i,t ＝{ω,p ₁ ,…,p _k ,d ₁ ,…,d _k where ω is a factor determining the duration of the next signal period, and ω T should be [ T ] according to the above limitation on the upper and lower thresholds of the period time _min ，T _max ]In the meantime. k is the number of phases designed for the intersection. { p ₁ ，…，p _k The sequence order of the } determines the order of change of phase in the next signal period. d is a radical of ₁ ，…，d _k The duration of the phase in the next signal period is determined for the phase factor. The design of the continuous motion space also avoids the frequent decision of the intelligent agent and better utilizes the characteristic of phase conversion.

3) Reward

The reward is information fed back to the agent by the regional traffic environment, and defines the direction for guiding the agent to learn, and finally evaluates the effect of the algorithm. The definition of the reward in reinforcement learning is not too complex, otherwise, the training process is too long, and the model cannot be converged. The core purpose of regional traffic signal timing is to reduce vehicle congestion in a region, reduce waiting time of vehicles and improve traffic efficiency. Thus, the waiting time of the vehicle is selected as the basis for the reward, set to be negative as a function of the waiting time, the smaller the waiting time, the greater the reward. The reward function for agent i at time t is defined as:

where NI is the number of all vehicles waiting at intersection i. t is t _n Is the sum of the time periods for which the num-th vehicle waits at the intersection i. ε is a constant, let ε =0.2.

S42: and designing a multi-agent collaborative optimization model. In the method provided by this embodiment, based on the state, action and reward of the agent designed in step S41, the algorithm completes the collaborative optimization process of the multi-agent in the area: the region has N agents, each agent has the mapping from the corresponding state to the action space, namely N strategy functions pi = (pi =) ₁ ，π ₂ ，…，π _N ). The local environment state observed by each agent at the time t is ob _i,t The agent receives state information from other agents and selects action output a according to the strategy function _i，t Will self-state s _i，t Awards r to and from the surroundings feedback _i，t . It should be noted here that each agent only obtains local environmental status information, i.e. status information adjacent to the intersection where the agent is located, and does not obtain global status. Therefore, the multi-agent reinforcement learning algorithm can enable a plurality of agents to acquire the space-time state characteristics of each other according to the global state of the area through the communication interaction and update among the agents, adjust the policy functions of the agents, realize the cooperative optimization of the intersection of the area, and complete the maximization of the reward of the global area, wherein the reward is

Its expanded formalized definition is:

maxR((ob ₁ ,…,ob _N ),(s ₁ ,…,s _N ),(a ₁ ,…,a _N ),(π ₁ ,…,π _N ))。

where max is a maximum function, R is a reward, ob is an observed value of the agent, s is a state of the agent, a is an action selected by the agent, pi is a policy function, and N is the number of agents

The multi-agent collaborative optimization model of the present embodiment is shown in fig. 2.

S43: and updating the state.

The multi-agent in this embodiment observes the respective local environment states, aggregates the local states, and further considers the global information of the entire area. The state that multi-agent observed is based on each lane at the crossing, according to the observation of lane traffic and the length of lining up, carries out independent calculation to each lane, and is specific: in each time step, a corresponding observation dictionary is set to store lane vehicle information, the vehicle numbers entering the range of a lane detector are recorded, the states of the vehicles are traversed one by one, the waiting time of each vehicle at the intersection is updated, the updated vehicle information is stored, the updated waiting time of the vehicles is judged, and the intersection environment state is updated according to different conditions.

S44: polymerization of the states.

After the lane state updating algorithm is updated, the state information of lanes at each intersection needs to be aggregated, and a basis is provided for the subsequent overall information communication among the multiple intelligent agents. In the embodiment, an attention mechanism is introduced to integrate important state information in the region space, and state information of the intelligent agent which has important interaction influence with the strategy selection of the current intelligent agent is screened out through the attention mechanism. For attention coefficient alpha between any two nodes i, j _ij Is calculated by multiplying the two

As the similarity between the nodes, and the result is normalized by using a softmax function, the weight coefficient s = softmax (key) is obtained _i ·key _j ). Weighting and summing the weighting coefficient and the value of each node to obtain an attention coefficient

S45: and (5) generating a strategy.

After the corresponding states are observed and aggregated, the set of states will be mapped to actions according to a policy function. Wherein, for the mapping of state set to action set: Π = s- > a. In the embodiment, a centralized training and decentralized execution framework is adopted, and the agent at each intersection maintains a criticic network of an NAC algorithm in a centralized local environment, and the criticic network is used for observing and acting all agents in an area to form a joint state-action pair for algorithm learning. The basic state calculation of the intelligent agent is carried out around each lane, and the lane states are firstly aggregated to be the intersection states. The Critic network only considers the action information of each agent in the basic environment of the Critic network, but the state information integrates global regional information through a space aggregation module. Therefore, the criticc network of each agent fits the global value function of the whole regional road network, and each agent adjusts the local strategy thereof according to the strategies of other agents to achieve the global optimal strategy.

The basic idea of this embodiment is to use a deterministic strategy pi _i The action to select the current time instant for each agent i is formally expressed as:

wherein s is _i，t Is an aggregate vector of the local environmental states generated by the lane state aggregation algorithm processing of step S44.

The Actor network from the NAC algorithm is the network parameter that generates the policy. In a criticic network, cooperative communication among multiple agents is considered, and the actions of other agents need to be taken into consideration when the network is updated. According to the formula for calculating the gradient of the strategy, it can be transformed into:

update procedure Q _i (s，a ₁ ，...，a _n ) Incorporated therein to embody a multi-agent collaborative communication process.

S46: and (5) training.

The training process of each agent i comprises an Actor network pi _i And Critic network Q _i And a corresponding initialized target network pi' _i And Q' _i 。

Each agent according to a corresponding objective function

Updating parameters of an Actor network by maximizing future accrued rewards

The Critic network is used for monitoring the Actor network, and after the Actor network selects an action of a strategy, the cycle, phase sequence and phase duration factor of the next signal cycle are output and are evaluated by the Critic network, so that the strategy of the Actor network is continuously coordinated and optimized. According to the Bellman equation: q(s) _t ，a _t )＝Q(s _t ，a _t )+α[r _t+1 +γmax _a Q(s _t+1 ，a)-Q(s _t ，a _t )]Updating the parameters of the criticic network is done by minimizing the loss function L (θ). The update formula is:

L(θ _i )＝E[(r _i +γQ′ _i (s′，a′)-Q _i (s，a)) ² ]。

in addition, the parameters of the target network are updated by attaching different weighting coefficients:

further, in step S4, the embodiment combines deep learning and reinforcement learning to solve the problem of dimension disaster in regional traffic signal timing. Because reinforcement learning has the characteristics of strong decision-making capability and poor perception capability. Conversely, depth has the characteristics of poor decision-making capability but strong perception capability. The advantages of the two are combined complementarily, and in the process before the reinforced learning, a deep learning related neural network is used for processing the huge and complicated data specialties. In the early stages of training, more actions need to be explored to expand the many possible outputs of the policy functions. In order to solve the problem of dimension disaster, avoid the possibility of avoiding blind exploration process and bad action generated in the initial exploration stage of reinforcement learning, an RNN (neural network) in deep learning is introduced, so that the strategy can be quickly optimized. The RNN recurrent neural network can better process time series data, the output of each moment is used as the input of the next moment, and the characteristic information of the last time can be fully utilized. Inputting sequence x for time t via Actor network _t Sequence h of hidden layer of RNN recurrent neural network _t And the action sequence a of the final output layer _t ＝(a ₁ ，...，a _n ) Is calculated as follows:

h _t ＝f(W _i ·x _t +W _h ·x _t-1 )

a _t ＝f(W _o ·h _t )，

three of which, W _i Weight matrix from input layer to hidden layer, W _h Is a weight matrix from hidden layer to hidden layer, W _o Is a weight matrix from the hidden layer to the output layer.

Thereby, the regional traffic waiting module established in the step S3 is usedAnd loading the models into the SUMO, setting intelligent agents based on each intersection in the environment according to the design and implementation processes of the steps S3 and S4, and randomly initializing an Actor network, a Critic network, a target network and a memory storage area of each intelligent agent. After initialization, enter a small loop, use the above to obtain the initial state of the agent, select an action according to the current policy, a policy function

Noise G is introduced, and the main purpose is to balance the exploration and utilization behaviors of the intelligent agent in reinforcement learning as much as possible.

In addition, in order to avoid blind exploration in a huge action space of a multi-agent, an RNN (neural network) is introduced to predict flow and cut an optimized action space. Along with the learning process of the intelligent agent, the initial exploration degree is high, the later exploration degree is gradually reduced, and the balance of exploration and utilization is finally approached. Store one copy(s) _t ,a _t ,r _t ,s _t+1 ) Updating the Critic network parameters by a method of minimizing a loss function, updating the Actor network by a method of strategy gradient, and updating the target network. Finally, a regional traffic signal timing output result is obtained, and a complete method flowchart is shown in fig. 3.

Example two

The embodiment mainly explains the advantages of the regional traffic signal timing method based on reinforcement learning provided by the first embodiment in a simulation experiment mode.

In the embodiment, three types of regional traffic signal timing methods are selected to compare the technical effects of the regional traffic signal timing method based on reinforcement learning provided in the embodiment. The contrast method is from initial fixed period timing to traditional lane induction timing, and then to the reinforced learning algorithm based on DQN, and the specific algorithm is explained as follows:

1) Fixed-Timing (Fixed-Timing): the timing algorithm with fixed cycle duration and fixed phase sequence and duration of the signals respectively lasts for 30 seconds, 20 seconds, 30 seconds and 20 seconds according to the cycle of 100 seconds and the phase sequences of east-west straight going, east-west left turning, south-north straight going and south-north left turning.

2) Lane-Sensing-Timing: the method is a traditional variable-period control method, when vehicles pass through the signal phase, the signal phase is kept for a certain green light time, and the signal phase is changed into a red light 5 seconds after the vehicles are not sensed.

3) DQN: the method is based on a deep reinforcement learning algorithm of a single intelligent agent, and strategies of all intersections in an intelligent agent leading area are used.

The algorithm has the effects mainly reflected in dredging regional traffic jam and improving traffic efficiency, and the used evaluation indexes are the average waiting time and the average queuing length of the vehicles. The specific meanings and calculation formulas are as follows:

A. average waiting time of vehicles in each period of region

The index is a macroscopic evaluation index that takes a region as an evaluation target, reflects analysis of the time characteristics of the entire region, and is a macroscopic evaluation index. The traffic flow in different time periods is different, and the algorithm realizes corresponding timing based on different traffic flow conditions. Dividing regional traffic flow into a plurality of time intervals which are numbered as period ₁ ，peropd ₂ ，...，peropd _n The average waiting time of each vehicle entity in the complete area is calculated according to the time periods, and the specific calculation formula is as follows:

wherein i' represents the number of the period, i represents the number of the intersection, and vehicle _ n _i′ Representing the sum of the number of all vehicles in the area under period i, t _ik Indicating the waiting time for the vehicle numbered k to use at the intersection numbered i. Therefore, the index is based on different time periods to obtain period _n And (4) evaluation indexes.

B. Average waiting time of vehicles at each phase of intersection

The index takes the intersection phase as a research object, embodies the analysis of the spatial characteristics of the intersection phase in different directions, and is a microscopic evaluation index. The traffic flow in four directions at the intersection is different, according to simulation data, more main lines in east-west direction and less branch lines in south-north direction are provided, and the coordination control condition of the timing algorithm on different traffic volumes can be evaluated through the average waiting time of vehicles with different phases. For four phases of the intersection i, the average waiting time of the vehicles in the four phases is calculated respectively, and the specific calculation formula is as follows:

wherein i represents a crossing number, j represents a phase number (j ∈ {1,2,3,4 }),

Indicating lane as lane _k (j ∈ {0,1,2 }) the number of all vehicles passing through the intersection on the lane.

Therefore, the index obtains 4 evaluation indexes based on 4 phases of each intersection.

C. Average queuing length at intersection

The index takes the area as an evaluation object, and calculates the average queuing length of all vehicles in the whole area at different time periods. The queuing length depends on the number of waiting vehicles and the length of the vehicles, is an evaluation index for visually reflecting the regional traffic jam condition, and has important significance for evaluating a timing algorithm.

For the intersection i, 12 lanes are provided in 4 directions, and the average queuing length index needs to be solved lane by lane for averaging. The calculation formula of the average queuing length of the intersection is as follows:

where n _ lane denotes the number of the lane, num _ lane _{n_lane} Representing the number of vehicles in line on the n-lane,

indicating the vehicle length of the k-th vehicle on the n _ lane.

Calculating the average queuing length of the vehicles at each intersection, and solving the average value to obtain the average queuing length of the vehicles in the complete area in the time period i', wherein the calculation formula is as follows:

in the experimental environment simulation, a 3 × 3 intersection area of 9 and a 4 × 4 intersection area of 16 are selected respectively, and a day-to-day random traffic flow traffic environment and a sudden congestion traffic environment are simulated respectively. Mainly through a series of technical points in the step S3, the method can complete the improvement of various indexes, and has more obvious optimization compared with other methods.

In a 3 x 3 region of 9 intersections, 9 intersections are distributed on three main roads in east-west directions from north to south, the north to south intersections are formed by connecting branches, and the 9 intersections form a 3 x 3 region shape to form an intersection matrix

A schematic diagram of the 3 × 3 intersection area is shown in fig. 4.

In the simulation area, detectors are arranged at intersections to count relevant traffic flow information, the range of the detectors is controlled within 100 meters, the average distance between adjacent intersections is set to be 280 meters, each road is set to be a bidirectional three-lane, the three-lane is respectively set to be a left-turn lane, a straight-going lane and a right-turn lane, and the speed limit of vehicles in the area is 50 kilometers per hour. Vehicle of zone is divided by B ₁ The entrance lane of any other crossing is entered from the other side of B ₁ The exit lanes of any other intersection leave, and the track of the vehicle simulates the random traffic condition of holidays in real life. When the vehicle track data is generated, the traffic flow of different road junctions and lanes is controlled by controlling the probability of the vehicle track data passing through the road junctions. The east-west direction is a main road, the simulated vehicle tracks pass more in the east-west direction, the simulated vehicle tracks pass less in the north-south direction, and a traffic flow statistical chart of the simulated area in each time period is shown in fig. 5. Since new vehicles are increasing each time, the traffic flow is increasedA peak occurs and then decreases. 10 sessions were simulated, each session lasting 20 minutes for a total of 200 minutes. And the traffic flow of each time interval is compressed to simulate the random traffic flow conditions of each time interval of a day on holidays.

The experimental result shows that in the simulated traffic scene of the 3 × 3 intersection area, on one hand, from the overall area, the overall effect of the method provided by the embodiment in each time period is better than that of other comparison methods. In the initial stage of the period, the vehicles passing through the region are relatively few, and two indexes of other algorithms are close to the indexes of the algorithm, but the average waiting time and the average queuing length of the vehicles in the region in the first two periods are the lower limits of all the algorithms. From the third time period, the traditional fixed timing and lane induction timing algorithms show a trend of rapid increase of indexes, which shows that the traditional fixed timing and lane induction timing algorithms cannot cope well when the traffic flow is large. The algorithm and the DQN algorithm based on reinforcement learning can more accurately and reasonably control the signal timing scheme when the traffic flow is increased. However, when the traffic flow is further increased, the DQN algorithm based on single agent control is slightly insufficient in the index of average waiting time, and even the effect in the fifth and eighth time periods is inferior to the lane induction algorithm, which shows that the policy function of the agent is determined only from the global environment, and the agent collaborative optimization is not set for each intersection, which may lead to insufficient understanding of the traffic environment. For the index of the average queue length, it can be seen that the differences of the other three algorithms except for the fixed timing are small, but the method provided by the embodiment one is still optimal. The lane sensing algorithm has a good algorithm effect under the index, and the algorithm is based on the longest lane queuing, preferentially processes the lanes where the traffic is sensed, and can be regarded as a partial simplified algorithm of the method provided by the embodiment. The effectiveness of the algorithm can be demonstrated by combining two algorithm indexes evaluated from the aspect of the area. On the other hand, the average waiting time of vehicles in 4 phases at each intersection in the area is counted, the index changes the focus of the algorithm effect from a macroscopic area angle to a microscopic phase angle, the characteristics of more vehicle flows of east-west main roads and less vehicle flows of north-south branch roads are embodied, but the algorithm coordinates and optimizes the average waiting time of the phases with larger vehicle flows as much as possible, the difference of the average waiting time of the phases of the east-west main roads and the south-north branch roads is reduced, the intelligent agent optimization strategy function of each intersection is reflected, and the phase coordination design idea is carried out.

In the area of 16 intersections of 4 x 4, 16 intersections are distributed on four main roads in east-west directions from north to south, the intersections from north to south are connected by branches, 16 intersections form a larger area shape of 4 x 4 to form an intersection matrix

A 4 x 4 intersection area schematic 6 is shown.

The distance average value of adjacent intersections is enlarged to 300 meters in the area, and compared with the area, lane selection and the like of intersection areas of 3 multiplied by 3, the area and lane selection of the intersection areas are greatly improved. The 4 x 4 area simulates the emergency situation in real life, i.e. a large amount of traffic is rushed into the peak. 1500 vehicles are placed in the experiment in a one-time simulation mode so as to verify the coping and recovery capacity of the regional traffic signal timing algorithm. The simulation period totals 36 periods, each of 10 minutes.

Experimental results show that at the initial stage of the peak inrush traffic flow, the average regional waiting time of each algorithm is longer, the traffic jam state of the region under the emergency situation is presented, after four periods of time, the coordination effect of the method provided by the embodiment one on signal timing is firstly embodied, so that the average regional waiting time starts to decline firstly, and at the moment, the average regional waiting time of other three algorithms is still in a higher state. The comparison shows that the traditional fixed time distribution and lane induction time distribution algorithms cannot well deal with the condition of sudden congestion, the early congestion time is easily overlong, the long-time signal phase is required to release the congestion time to relieve the congestion time, and the indexes of the two algorithms are obviously reduced to normal values after the 20 th time period in the experiment. The average queuing length of the vehicles has certain delay compared with the waiting time, and certain peak values can appear when more vehicles are queued at the later stage. The two reinforcement learning algorithms can better cope with sudden congestion, so that a regional traffic system has better recovery capability, wherein the method provided by the first embodiment has an optimal algorithm effect, can respond at a higher speed, coordinate a signal timing scheme, and effectively relieve traffic congestion, and compared with the other three algorithms, the method respectively improves the average waiting time index effects of regional vehicles by 49.6%, 33.4% and 5.9%, and respectively improves the average queuing navigation index effects of regional intersections by 51.6%, 29.2% and 19.5%. On the other hand, the average waiting time of the intersection phases is analyzed, when the number of intersections in a 4 x 4 intersection area is increased, the algorithm can be seen to coordinate the phases of each intersection uniformly, and the waiting time when the obviously abnormal phase is not matched can be seen, so that the communication among the intelligent bodies can be still coordinated when the area scale of the algorithm is increased and the number of the intelligent bodies is increased, an optimal strategy is selected, and the coordination control is carried out according to the phase sequence and the duration time of different intersections.

Finally, the RNN recurrent neural network adopted in step S4 of the embodiment solves the problem of the dimensionality disaster caused by the area scale growth. When the area scale is increased from 3 × 3 to 4 × 4, the number of intersection signal lamps in the area is increased, the action space of the coordination strategy is greatly increased, the spatio-temporal characteristic index of signal timing contained in the vehicle trajectory data is increased, and the characteristics of non-convergence, reduction in prediction capability and the like may exist. Therefore, the effect of the algorithm applied in a small area is not necessarily applied in a large area. The method provided by the embodiment avoids the blind exploration process and the possibility of generating bad actions generated in the initial exploration stage of reinforcement learning, the introduced RNN recurrent neural network can solve the dimensionality disaster to a certain extent, the training process of the algorithm in a 4 x 4 region is iterated for 100 times according to the convergence condition of the algorithm, the reward is increased along with the iteration times, and finally the training process tends to be stable and forms the maximum value of the reward after 80 iterations, and the reinforcement learning algorithm converges. For the waiting time in the evaluation index, the iteration condition of the total waiting time is given, the total waiting time increases along with the explosion of vehicle data after the surge of a peak, the total waiting time converges to a smaller value after the algorithm training, the dimension disaster does not occur, and the effectiveness of the algorithm is verified.

EXAMPLE III

The present invention also provides a reinforcement learning-based regional traffic signal timing system, as shown in fig. 7, the system includes:

and the data acquisition unit 700 is used for acquiring regional traffic environment data and vehicle track data.

And the memory 701 is connected with the data acquisition device 700 and is used for storing regional traffic environment data, vehicle track data and software control instructions. The software control instructions are used for implementing the reinforcement learning-based regional traffic signal timing method provided above. The memory may be a computer readable storage medium.

And the processor 702 is connected with the memory 701 and is used for calling and executing the software control instruction to obtain a regional traffic signal timing scheme.

The processor 702 employed in the present invention includes:

and the element extraction module is used for performing data extraction on the regional traffic environment data and the vehicle track data to obtain intersection and signal lamp constituent elements.

And the timing task determining module is used for determining the regional traffic signal timing task based on intersection and signal lamp constituent elements.

And the waiting model building module is used for building a regional traffic waiting model based on the regional traffic signal timing task.

And the timing scheme generation module is used for learning and training a regional traffic waiting model by adopting a multi-agent reinforcement learning algorithm, completing the coordination and optimization process of each intelligent agent strategy in the region by combining the optimized NAC algorithm based on the strategy, and introducing an RNN (neural network) to process the optimized coordination result generated in the coordination and optimization process to obtain a regional traffic signal timing scheme.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A regional traffic signal timing method based on reinforcement learning is characterized by comprising the following steps:

and learning and training the regional traffic waiting model by adopting a multi-agent reinforcement learning algorithm, finishing the coordination optimization process of each intelligent agent strategy in the region by combining an optimized NAC algorithm based on the strategy, and introducing an RNN (neural network) to process an optimized coordination result generated in the coordination optimization process to obtain a regional traffic signal timing scheme.

2. The reinforcement learning-based regional traffic signal timing method according to claim 1, wherein the construction of the regional traffic waiting model based on the regional traffic signal timing task specifically comprises:

constructing a regional traffic environment simulation model;

3. The reinforcement learning-based regional traffic signal timing method according to claim 2, wherein the regional traffic waiting model comprises:

4. The reinforcement learning-based regional traffic signal timing method according to claim 1, wherein the learning and training of the regional traffic waiting model by using a multi-agent reinforcement learning algorithm is combined with the optimized strategy-based NAC algorithm to complete the coordination optimization process of each agent strategy in the region, and an RNN recurrent neural network is introduced to process the optimization coordination result generated in the coordination optimization process to obtain a regional traffic signal timing scheme, specifically comprising:

designing basic elements of the multi-agent; the base element includes: status, actions and rewards;

aggregating the updated state information of each intersection lane to obtain aggregated state information;

generating a strategy function according to the screening state information by adopting an optimized NAC algorithm based on the strategy; the strategy function is the mapping from a state set to an action set; the NAC algorithm includes: an Actor network and a Critic network;

5. The reinforcement learning-based regional traffic signal timing method according to claim 4, wherein the generating a policy function according to the screening state information by using the optimized policy-based NAC algorithm specifically comprises:

6. The reinforcement learning-based regional traffic signal timing method of claim 5, wherein the parameters of the Critic network are updated by minimizing a loss function.

7. The reinforcement learning-based regional traffic signal timing method according to claim 4, wherein the features output at the current moment in the RNN recurrent neural network are used as input features at the next moment.

8. A reinforcement learning-based regional traffic signal timing system, comprising:

the memory is connected with the data acquisition unit and used for storing the regional traffic environment data, the vehicle track data and the software control instruction; the software control instructions are used for implementing the reinforcement learning-based regional traffic signal timing method according to any one of claims 1-7;

9. The reinforcement learning-based regional traffic signal timing system of claim 8, wherein the processor comprises:

and the timing scheme generation module is used for learning and training the regional traffic waiting model by adopting a multi-agent reinforcement learning algorithm, completing the coordination optimization process of each agent strategy in the region by combining the optimized NAC algorithm based on the strategy, and introducing an RNN (radio network) to process the optimized coordination result generated in the coordination optimization process to obtain a regional traffic signal timing scheme.

10. The reinforcement learning-based regional traffic signal timing system of claim 8, wherein the memory is a computer-readable storage medium.