CN118053311A

CN118053311A - Traffic signal control method and device based on multi-agent reinforcement learning model

Info

Publication number: CN118053311A
Application number: CN202410453998.4A
Authority: CN
Inventors: 杨飚; 卢兴盛
Original assignee: Lianyi Yunke Beijing Technology Co ltd
Current assignee: Lianyi Yunke Beijing Technology Co ltd
Priority date: 2024-04-16
Filing date: 2024-04-16
Publication date: 2024-05-17

Abstract

The invention discloses a traffic signal control method and device based on a multi-agent reinforcement learning model. Wherein the method comprises the following steps: acquiring traffic state data of each of a plurality of intersections in a target area, wherein the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections; the traffic state data of each of the plurality of intersections is respectively input into a plurality of reinforcement learning models trained in advance, the plurality of reinforcement learning models respectively output a plurality of traffic signal data, wherein the plurality of traffic signal data are respectively used for controlling signal lamps of the plurality of intersections, and the plurality of reinforcement learning models respectively select the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies. The invention solves the technical problem of poor effect of traffic signal control caused by independent decision making of each reinforcement learning model when the signals of a plurality of intersections are controlled by adopting a plurality of reinforcement learning models.

Description

Traffic signal control method and device based on multi-agent reinforcement learning model

Technical Field

The invention relates to the field of traffic signal control, in particular to a traffic signal control method and device based on a multi-agent reinforcement learning model.

Background

The traffic signal timing optimization control is a main method for guiding vehicles to pass through an intersection, and by improving the timing mode of traffic signal lamps, the intersection can obtain higher traffic throughput, effectively reduce congestion, reduce automobile exhaust emission and optimize traveling.

Along with the development of artificial intelligence technology, the application of the artificial intelligence technology to design a self-adaptive traffic signal control system becomes wider and wider, and when a plurality of reinforcement learning models are adopted to respectively control signals of a plurality of intersections, each reinforcement learning model is independent when making a decision, and can not control the signals more comprehensively.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a traffic signal control method and a traffic signal control device based on multi-agent reinforcement learning models, which at least solve the technical problem of poor traffic signal control effect caused by independent decision making of each reinforcement learning model when signals of a plurality of intersections are controlled by adopting a plurality of reinforcement learning models.

According to an aspect of the embodiment of the present invention, there is provided a traffic signal control method based on a multi-agent reinforcement learning model, including: acquiring traffic state data of each of a plurality of intersections in a target area, wherein the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections; the traffic state data of each intersection is respectively input into a plurality of reinforcement learning models trained in advance, the plurality of reinforcement learning models are respectively used for outputting a plurality of traffic signal data, the plurality of traffic signal data are respectively used for controlling signal lamps of the intersections, the reinforcement learning models are in one-to-one correspondence with the intersections, the reinforcement learning models are respectively used for selecting the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies, the action sets are respectively used for the reinforcement learning models, and the action sets respectively comprise selectable traffic signal data.

Optionally, the traffic state data of each of the plurality of intersections is respectively input into a plurality of reinforcement learning models trained in advance, and the plurality of reinforcement learning models respectively output a plurality of traffic signal data for controlling the plurality of intersections, including: inputting target traffic state data of a target intersection into a target reinforcement learning model, and determining respective game gain function values of traffic signal data which can be selected in a target action set by the target reinforcement learning model based on a preset game gain function and the target traffic state data, wherein the target action set is traffic signals which can be selected by the target reinforcement learning model; the target reinforcement learning model obtains a target Nash equilibrium solution corresponding to a target intersection according to a game income function value based on Nash equilibrium strategy, wherein the target Nash equilibrium solution is probability distribution of selecting data in a target action set; selecting target traffic signal data in a target action set according to a target Nash equilibrium solution by a target reinforcement learning model, wherein the target traffic signal data is used for controlling signal lamps of a target intersection; by adopting a mode of obtaining target traffic signal data, the traffic state data of each of a plurality of intersections is respectively input into a plurality of reinforcement learning models trained in advance, and the plurality of reinforcement learning models respectively output a plurality of traffic signal data for controlling the plurality of intersections.

Optionally, the target reinforcement learning model obtains a target Nash equilibrium solution corresponding to the target intersection according to the game benefit function value based on Nash equilibrium strategy, and the method comprises the following steps: acquiring adjacent traffic state data of an adjacent intersection of a target intersection and experience distribution of an adjacent reinforcement learning model corresponding to the adjacent intersection, wherein the experience distribution of the adjacent reinforcement learning model is probability distribution of selecting data in an adjacent action set by the adjacent reinforcement learning model under the condition that input data are the adjacent traffic state data; the target reinforcement learning model predicts adjacent Nash equilibrium solutions corresponding to adjacent intersections obtained by the adjacent reinforcement learning model based on Nash equilibrium strategies according to experience distribution of the adjacent reinforcement learning models; the target reinforcement learning model obtains a target Nash equilibrium solution according to the adjacent Nash equilibrium solution and the game income function value.

Optionally, the target reinforcement learning model selects target traffic signal data in the target action set according to a target nash equalization solution, including: the target reinforcement learning model randomly selects target traffic signal data in a target action set based on probability distribution of target Nash equilibrium solution characterization.

Optionally, the target reinforcement learning model of the plurality of reinforcement learning models is trained by: the target reinforcement learning model in the reinforcement learning models is obtained by training in the following manner: acquiring a plurality of sample state data; respectively inputting a plurality of sample state data into a plurality of original reinforcement learning models, and respectively selecting a plurality of sample traffic signal data for output from a plurality of preset action sets by the plurality of original reinforcement learning models based on Nash equilibrium strategies, wherein the plurality of sample traffic signal data corresponds to the plurality of sample state data one by one, and the plurality of original reinforcement learning models are original models of the plurality of reinforcement learning models respectively; under the condition that a plurality of sample intersections are controlled to be respectively in a plurality of sample state data, signal lamp control is carried out on the plurality of sample intersections by adopting sample traffic signals corresponding to the plurality of sample state data, and the traffic state data of the plurality of sample intersections after control are determined to be a plurality of feedback traffic state data, wherein the plurality of feedback traffic state data are in one-to-one correspondence with the plurality of sample state data; training a plurality of original reinforcement learning models according to a plurality of sample state data, a plurality of sample traffic signal data, a plurality of feedback traffic state data and a preset reward function.

Optionally, training the plurality of original reinforcement learning models according to the plurality of sample state data, the plurality of sample traffic signal data, the plurality of feedback traffic state data, and the predetermined reward function, including: determining a plurality of rewards corresponding to the sample state data one by one according to the sample state data, the feedback traffic state data and the rewards function; determining a plurality of learning rates according to a plurality of Nash equilibrium solutions obtained before a plurality of original reinforcement learning models output a plurality of sample traffic signal data, wherein the plurality of learning rates respectively represent the magnitude of changing the parameter of the reinforcement learning model during training; training a plurality of original reinforcement learning models according to the plurality of reward values and the plurality of learning rates.

Optionally, the method further comprises: acquiring position data and speed data of an emergency vehicle in a target intersection and the length of an emergency lane of the target intersection, wherein the length of the emergency lane is determined by the number of lanes of the target intersection and the traffic flow density and the flow rate of the target intersection in the parking, semi-saturated flow and saturated flow states; and under the condition that the emergency vehicle reaches the starting position of the emergency lane, controlling the traffic light of the target intersection to be a green light.

According to another aspect of the embodiments of the present invention, there is also provided a traffic signal control device based on a multi-agent reinforcement learning model, including: the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring traffic state data of each of a plurality of intersections in a target area, wherein the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections; the control module is used for respectively inputting traffic state data of each of a plurality of intersections into a plurality of reinforcement learning models trained in advance, respectively outputting a plurality of traffic signal data by the plurality of reinforcement learning models, wherein the plurality of traffic signal data are respectively used for controlling signal lamps of the plurality of intersections, the plurality of reinforcement learning models are in one-to-one correspondence with the plurality of intersections, the plurality of reinforcement learning models respectively select the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies, the plurality of action sets are respectively action sets of the reinforcement learning models, and the plurality of action sets respectively comprise selectable traffic signal data.

According to still another aspect of the embodiments of the present invention, there is further provided a non-volatile storage medium, where the non-volatile storage medium includes a stored program, and when the program runs, the device where the non-volatile storage medium is controlled to execute any one of the traffic signal control methods based on the multi-agent reinforcement learning model.

According to still another aspect of the embodiments of the present invention, there is further provided a computer device, including a processor, configured to run a program, where the program executes any one of the traffic signal control methods based on the multi-agent reinforcement learning model.

In the embodiment of the invention, the respective traffic state data of a plurality of intersections in a target area are obtained, wherein the respective traffic state data of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections; the traffic state data of each intersection is respectively input into a plurality of reinforcement learning models trained in advance, the plurality of reinforcement learning models respectively output a plurality of traffic signal data, wherein the plurality of traffic signal data are respectively used for controlling signal lamps of the intersections, the plurality of reinforcement learning models are in one-to-one correspondence with the plurality of intersections, the plurality of reinforcement learning models respectively select the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies, the plurality of action sets are the action sets of each reinforcement learning model, and the plurality of action sets respectively comprise selectable traffic signal data, so that the aim of controlling the signals of the reinforcement learning models based on Nash equilibrium strategies is fulfilled, the connection of the reinforcement learning models is reinforced, other learning models are considered when each reinforcement learning model is used for controlling the signals, the effect of improving the traffic signal control is realized, and the technical problem that the effect of traffic signal control is poor due to the fact that when the reinforcement learning models are used for controlling the signals of the intersections respectively, decisions made by the reinforcement learning models are independent is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 shows a hardware block diagram of a computer terminal for implementing a traffic signal control method based on a multi-agent reinforcement learning model;

FIG. 2 is a flow chart of a traffic signal control method based on a multi-agent reinforcement learning model according to an embodiment of the present invention;

FIG. 3 is a schematic illustration of an intersection provided in accordance with an alternative embodiment of the present invention;

FIG. 4 is a schematic diagram of signal phases provided in accordance with an alternative embodiment of the present invention;

FIG. 5 is a schematic illustration of the variation of signal lights at different actions provided in accordance with an alternative embodiment of the present invention;

FIG. 6 is a schematic illustration of a plurality of intersections included in a target area provided in accordance with an alternative embodiment of the present invention;

FIG. 7 is a schematic illustration of emergency vehicle traffic provided in accordance with an alternative embodiment of the present invention;

fig. 8 is a block diagram of a traffic signal control device based on a multi-agent reinforcement learning model according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In accordance with an embodiment of the present invention, there is provided a method embodiment of traffic signal control based on a multi-agent reinforcement learning model, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical sequence is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in a different order than that illustrated herein.

The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal for implementing a traffic signal control method based on a multi-agent reinforcement learning model. As shown in fig. 1, the computer terminal 10 may include one or more (shown as 102a, 102b, … …,102 n) processors (which may include, but are not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

It should be noted that the one or more processors and/or other data processing circuits described above may be referred to herein generally as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module or incorporated, in whole or in part, into any of the other elements in the computer terminal 10. As referred to in embodiments of the application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination connected to the interface).

The memory 104 may be used to store software programs and modules of application software, such as a program instruction/data storage device corresponding to a traffic signal control method based on a multi-agent reinforcement learning model in the embodiment of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory 104, that is, implements the traffic signal control method based on the multi-agent reinforcement learning model of the application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10.

Fig. 2 is a schematic flow chart of a traffic signal control method based on a multi-agent reinforcement learning model according to an embodiment of the present invention, as shown in fig. 2, the method includes the following steps:

Step S202, acquiring respective traffic state data of a plurality of intersections in a target area, wherein the respective traffic state data of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections.

In this step, the target area may be a traffic network formed by interconnecting a plurality of traffic intersections, and the target area may include a plurality of intersections, and each intersection may be provided with a signal lamp. Any one of the plurality of intersections may be a typical urban road intersection controlled by traffic lights. The effect of pedestrian motion is not considered alone, as in real traffic pedestrian motion tends to occur simultaneously with the accompanying motion through the vehicle. Fig. 3 is a schematic view of an intersection provided according to an alternative embodiment of the present invention, such as the four-way three-lane intersection shown in fig. 3, where the distance from the lane start point to the stop line of the intersection is 750 meters. The vehicle can enter the environment from four directions of east, west, south and north, and each direction is divided into 12 entering lanes and 12 exiting lanes consisting of three lanes (left turning lane, right turning lane and straight running lane). In an actual traffic environment, the intersection controls traffic flow in all directions through 8 different traffic lights, wherein the traffic flow of straight and right turns is controlled by the same signal light. The set of signal lamps is shown in the formula:

as shown in fig. 3, tl _N、tl_NL、tl_W、tl_WL、tl_S、tl_SL、tl_E、tl_EL are traffic lights of the lanes in the drawing, and the states of the traffic lights are the same as the real traffic environment, and have 3 possible change states:

for four driving-in lanes at the intersection, converting the vehicle information in a certain cell m on lane n into the position information of the vehicle And speed information/>. For intersection location information, boolean values are used to represent:

The position information of the whole intersection can be dimensioned as Matrix P represents:

carrying out normalization processing on the speed information of the cells, wherein the normalization speed is expressed as:

wherein, Representing the average speed of the vehicle in the cell,/>Representing the maximum speed of the road restriction. The speed information of the intersection can be used with the dimension/>Matrix V represents:

in summary, the traffic state data of the intersection is composed of a position matrix and a speed matrix, and the traffic state data matrix is expressed as:

that is, traffic state data for any one of the plurality of intersections characterizes the current traffic state of the intersection, particularly in relation to the vehicle position and speed on the road to which the intersection is directly connected.

Step S204, respectively inputting traffic state data of each of a plurality of intersections into a plurality of reinforcement learning models trained in advance, respectively outputting a plurality of traffic signal data by the plurality of reinforcement learning models, wherein the plurality of traffic signal data are respectively used for controlling signal lamps of the plurality of intersections, the plurality of reinforcement learning models are in one-to-one correspondence with the plurality of intersections, and respectively selecting the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies by the plurality of reinforcement learning models, wherein the plurality of action sets are respectively action sets of the plurality of reinforcement learning models, and the plurality of action sets respectively comprise selectable traffic signal data.

In this step, the reinforcement learning models may be a plurality of agents, where the plurality of agents are in one-to-one correspondence with the plurality of intersections, and each agent may output traffic signal data for controlling the intersection according to the current traffic state of the corresponding intersection, where the traffic signal data is data for controlling the signal lamp of the intersection to flash, fig. 4 is a schematic diagram of signal phases provided according to an alternative embodiment of the present invention, and as shown in fig. 4, the phases of the traffic signal data provided by the present invention are transformed according to the order of east-west straight (EWA), west-left (EWLA), north-south straight (NSA), north-south left (NSLA), where the phase set phase is defined as:

EWA: the straight traffic signal lamp in the east-west direction is changed into a green lamp, and vehicles on lanes in the east-west direction are allowed to go straight or turn right; EWLA: the left-hand communication signal lamp in the east-west direction is changed into a green lamp, and vehicles on lanes in the east-west direction are allowed to turn left; NSA: the direct traffic signal lamp in the north-south direction is changed into a green lamp, and vehicles on lanes in the north-south direction are allowed to directly go or turn right; NLSA: the north-south left-turn traffic lights turn green, allowing vehicles on the north-south lanes to turn left.

In each time period t, the intelligent agent judges whether the current signal phase needs to be switched according to the input state matrix, and outputs actionsFrom the action set/>Is selected from, i.e./>If/>Switching the signal lamp to the next phase in the phase set; if/>The current signal phase is kept unchanged. Each phase unit duration is/>When the two signal phases are switched, there is a yellow light time of 3s in between. Action set/>Expressed as:

As can be seen from the above description, since the plurality of intersections may include a t-intersection, an intersection, and an irregularly shaped intersection, traffic signal data that may be selectively output by each of the plurality of intersections may also be greatly different, that is, the plurality of intersections have respective action sets.

FIG. 5 is a schematic diagram showing the change of signal lights during different actions according to an alternative embodiment of the present invention, as shown in FIG. 5, the signal phase at the current time is a green light in the north-south direction, if the action selected in time period t is the same as the previous time periodThe same action is taken, i.e./>The current green light phase in the north-south direction is continued for the duration ofNo yellow light stage; conversely, if the action/>, is selected in time period t+1Switching the signal phase to the next north-south left turn signal phase will start a 3 second yellow light between the two actions.

It should be noted that, the traffic signal is controlled periodically, and the signal control methods mentioned in the present invention are all signal control methods within one period, that is, the intelligent agent is based on the traffic state data of the intersection controlled by the intelligent agent when a certain period starts (when the previous period ends); then determining the state of the signal lamp of the intersection in the period according to the traffic state data, namely traffic signal data; the signal lights of the intersection can then be controlled in accordance with the traffic signal data.

It should be noted that, in this step, the situation of the plurality of agents in the actual application is mainly described, since the plurality of agents are trained agents, the learning rate of the agents is set to be infinitely close to 0, that is, the agents at this time hardly learn experiences from the feedback of the environment, although the traffic state data of the intersection at the end of this period may still be obtained, and the rewards of the traffic signal data output by the plurality of agents at this time are determined according to the traffic state data of the intersection at the end and a predetermined rewarding function, and the parameters inside the model are adjusted according to the rewards, but the magnitude of the adjustment parameters may be as small as possible.

In this step, when each agent selects traffic signal data to be output in the respective action space, that is, when each agent makes a decision in the period, the decision is made based on the nash equalization policy, and the decision of the agent itself is determined by taking possible decisions of other agents around the agent into consideration. The Nash equilibrium strategy is non-cooperative game equilibrium, in the invention, a plurality of agents can be used as a plurality of participants in a game, and the plurality of agents are in competition relation, wherein the Nash equilibrium strategy refers to that in a certain game process, each participant considers that the strategies of other participants are already determined, and under the premise, each participant selects own optimal strategy. In other words, each participant considers the policies of the other participants to be fixed, and on this premise, they each choose a policy that enables them to obtain the maximum benefit. The characteristics of the nash equalization strategy are that the optimal strategy for each participant is selected based on the strategies of the other participants. Thus, if the policy choices of the other participants are unchanged, any single participant cannot increase his own revenue by changing his own policy. This is the core of Nash equilibrium, i.e. all participants cannot improve their own situation by changing their own policies without the other people changing the policies.

Based on the Nash equilibrium strategy, if a relatively correct action (strategy) is selected in the decision process of action selection, the duration time of the trial-and-error process can be effectively shortened.

Through the steps, the purpose that the reinforcement learning models perform signal control based on Nash equilibrium strategies can be achieved, so that the connection of the reinforcement learning models is enhanced, other learning models are considered when each reinforcement learning model performs signal control, the effect of improving traffic signal control is achieved, and the technical problem that the effect of traffic signal control is poor due to the fact that each reinforcement learning model makes independent decisions when the reinforcement learning models are adopted to control signals of a plurality of intersections is solved.

As an alternative embodiment, inputting traffic state data of each of a plurality of intersections into a plurality of reinforcement learning models trained in advance, respectively, outputting a plurality of traffic signal data for controlling the plurality of intersections by the plurality of reinforcement learning models, respectively, includes: inputting target traffic state data of a target intersection into a target reinforcement learning model, and determining respective game gain function values of traffic signal data which can be selected in a target action set by the target reinforcement learning model based on a preset game gain function and the target traffic state data, wherein the target action set is traffic signals which can be selected by the target reinforcement learning model; the target reinforcement learning model obtains a target Nash equilibrium solution corresponding to a target intersection according to a game income function value based on Nash equilibrium strategy, wherein the target Nash equilibrium solution is probability distribution of selecting data in a target action set; selecting target traffic signal data in a target action set according to a target Nash equilibrium solution by a target reinforcement learning model, wherein the target traffic signal data is used for controlling signal lamps of a target intersection; by adopting a mode of obtaining target traffic signal data, the traffic state data of each of a plurality of intersections is respectively input into a plurality of reinforcement learning models trained in advance, and the plurality of reinforcement learning models respectively output a plurality of traffic signal data for controlling the plurality of intersections.

The present alternative embodiment provides a process for selecting traffic signal data based on a nash equalization strategy when a certain target agent (target reinforcement learning model) among multiple agents performs road control. Optionally, a preset game gain function exists in the target intelligent agent, the game gain function refers to gain obtainable by each participant according to the type of the participant and the selected action when the participant participates in the game, and according to the game gain function, an optimal strategy of the participant, namely a strategy capable of maximizing the utility of the strategy in a strategy set (action space), can be selected. In this alternative embodiment, the game profit function u _i may be:

The above formula represents: in general and gaming, an agent i performs an action (traffic signal data) a _i when it faces an environmental state (traffic state data) s _i, where-i means non-i, as a gaming benefit function when it competes with an agent (agent adjacent to agent i) to perform an action as a _-i. In the above formula, agent j is one of the directly adjacent agents of agent I, j e I _-i,I_-i representing the set of values of all non-I in all set I. l _ji is a road segment from the signalized intersection associated with agent j to the signalized intersection associated with agent i (a road directly connecting the intersection corresponding to agent i and the intersection corresponding to agent j), Is the expected number of vehicles for road segment l _ji when agent i faces status to act a _i itself to act a _-i for competitor agent j,/>The existing number of vehicles for road section l _ji, weight/>The degree to which the status of the link l _ji needs improvement is expressed can be further defined by the following equation:

The target intelligent agent can firstly determine respective game gain function values of traffic signal data which can be selected in the target action set according to the input target traffic state data and game gain functions, then predict the strategy of the competing intelligent agent based on Nash equilibrium strategy, consider the current decision of the competing intelligent agent (the traffic signal data which are respectively output by the competing intelligent agent), combine the respective game gain function values of the traffic signal data which can be selected in the target action set to obtain a target Nash equilibrium solution, and select traffic control signals in the target action set to output according to the target Nash equilibrium solution.

It should be noted that the nash equalization solution may not be a specific choice, but rather a probability distribution that directs the final choice, and provides the agent with a probability distribution that selects data in the target action set. The meaning of the obtained Nash equilibrium solution to the target agent is that the Nash equilibrium solution reveals the optimal strategy selection of the target agent in the game of the period.

It should be noted that, in a certain control period (in a certain round of game), multiple agents may execute the above steps at the same time, to obtain a self-nash equilibrium solution, and determine traffic signal data output in the period according to the self-nash equilibrium solution. In the above process, each of the plurality of agents does not learn the selections made by the other agents during the present period, but rather considers decisions made by the other agents during the present period based on previous selections made by the other agents. That is, the decisions of multiple agents are derived simultaneously (without a sequential score) and independently of each other during each cycle.

As an optional embodiment, the target reinforcement learning model obtains a target nash equilibrium solution corresponding to the target intersection according to the game benefit function value based on a nash equilibrium strategy, and includes: acquiring adjacent traffic state data of an adjacent intersection of a target intersection and experience distribution of an adjacent reinforcement learning model corresponding to the adjacent intersection, wherein the experience distribution of the adjacent reinforcement learning model is probability distribution of selecting data in an adjacent action set by the adjacent reinforcement learning model under the condition that input data are the adjacent traffic state data; the target reinforcement learning model predicts adjacent Nash equilibrium solutions corresponding to adjacent intersections obtained by the adjacent reinforcement learning model based on Nash equilibrium strategies according to experience distribution of the adjacent reinforcement learning models; the target reinforcement learning model obtains a target Nash equilibrium solution according to the adjacent Nash equilibrium solution and the game income function value.

The present alternative embodiment provides a process in which the target agent finds its corresponding target nash equalization solution. Optionally, when the signal control in the target area is performed, although the multiple intersections are closely related, if traffic signal control of all other intersections is considered when each intersection makes a decision, the calculation amount is too large, the efficiency is low, and in the traffic network, the intersection with the greatest influence on each intersection is the directly adjacent intersection, so when the traffic signal control is performed at each intersection, only the directly adjacent intersection can be used as a competitive intelligent agent to assist in the signal control. Fig. 6 is a schematic diagram of a plurality of intersections included in a target area according to an alternative embodiment of the present invention, as shown in fig. 6, in the target area, there are 7 nodes in total, that is, 7 intersections, where each edge linking two nodes forms a set of signal intersections direct adjacent relationship, for example: i1 and I4 together with the connected edges represent the adjacent relationship of the intersections I1 and I4, and the two intersections are directly adjacent to the upstream and downstream signal intersections, namely the intersections I1 and I4 are adjacent to each other.

Optionally, to obtain a nash equalization solution, a policy selection of other competing agents needs to be predicted. Considering that a game process is adopted to realize a Nash equilibrium strategy, because the experience distribution of the intelligent agent is to cope with Nash equilibrium solution of the competitive intelligent agent when the multi-intelligent agent system realizes stable convergence globally, the strategy selection expectation of the competitive intelligent agent is approximately replaced by the experience value of multi-intelligent agent reinforcement learning. That is, the target agent may directly obtain the learned empirical value of the competing agent, approximately replacing the strategy made by the current predicted competing agent (neighboring agent). Specifically, using empirical values of multi-agent reinforcement learning to approximate the strategic choice expectations of competing agents, assuming that when target agent i faces state s _i, competing agent j faces state s _j, if the agent learns using a Q network, the values can be usedDistribution/>To represent the basic components of the mixing strategy:

Wherein a _j is an adjacent action set; in order to select probability distribution of data (i.e., experience distribution of adjacent reinforcement learning model corresponding to adjacent intersection) in adjacent action set by adjacent reinforcement learning model under condition that input data is adjacent traffic state data, the probability distribution can be abbreviated as Pr _j; /(I) The Q value learned for competing agent j.

After finding the empirical distribution Pr _j of the neighboring agents, the distribution map may be used to makeSolving adjacent Nash equilibrium solution/>, of adjacent agentBased on the known expected mixed strategy adopted by the competing agent and the game income function, the mixed strategy Nash equilibrium solution/>, of the agent i can be solved：

Wherein σ _i in the game is a hybrid strategy in the game theory, which refers to a probability distribution of the participant I epsilon Ig in the pure strategy space. WhileThe strategy combination which is a strict mixed strategy Nash equilibrium solution of games and is jointly participated by all multi-agents is shown, the strategy combination is the decision behavior which mainly seeks to maximize the benefit of the agents through analysis of decision results of the agents, and the decision can be made to be no worse than any other pure strategies.

As an alternative embodiment, the target reinforcement learning model selects target traffic signal data in the target action set according to a target nash equalization solution, including: the target reinforcement learning model randomly selects target traffic signal data in a target action set based on probability distribution of target Nash equilibrium solution characterization.

Alternatively, the target nash equalization solution is a probability distribution of selecting data in the target action set, and the target traffic signal data may be randomly selected in the target action set based on the probability distribution. By distribution mapping: and acquiring decision sampling probability distribution. In probability distribution/> And (3) randomly sampling to obtain the execution action of the intelligent agent i, namely the instant signal control scheme (target traffic signal data) of the intersection associated with the execution action.

As an alternative embodiment, the target reinforcement learning model of the plurality of reinforcement learning models is trained by: the target reinforcement learning model in the reinforcement learning models is obtained by training in the following manner: acquiring a plurality of sample state data; respectively inputting a plurality of sample state data into a plurality of original reinforcement learning models, and respectively selecting a plurality of sample traffic signal data for output from a plurality of preset action sets by the plurality of original reinforcement learning models based on Nash equilibrium strategies, wherein the plurality of sample traffic signal data corresponds to the plurality of sample state data one by one, and the plurality of original reinforcement learning models are original models of the plurality of reinforcement learning models respectively; under the condition that a plurality of sample intersections are controlled to be respectively in a plurality of sample state data, signal lamp control is carried out on the plurality of sample intersections by adopting sample traffic signals corresponding to the plurality of sample state data, and the traffic state data of the plurality of sample intersections after control are determined to be a plurality of feedback traffic state data, wherein the plurality of feedback traffic state data are in one-to-one correspondence with the plurality of sample state data; training a plurality of original reinforcement learning models according to a plurality of sample state data, a plurality of sample traffic signal data, a plurality of feedback traffic state data and a preset reward function.

Alternatively, the process of training multiple agents is similar to the process of making decisions by multiple agents during a certain period described above. In a certain training period, a plurality of original agents respectively select the current sample traffic signal data to output in respective action sets based on Nash equalization strategies according to the input sample state data; then, a plurality of sample intersections can be controlled according to the plurality of output sample traffic signals, and feedback of the plurality of sample intersections is collected, namely traffic state data of the plurality of sample intersections after control, namely a plurality of feedback traffic state data; and then, parameter adjustment can be carried out on the plurality of original reinforcement learning models according to the plurality of sample state data, the plurality of sample traffic signal data, the plurality of feedback traffic state data and the preset rewarding function, and the plurality of original reinforcement learning models are trained.

The invention can release and restrain the traffic jam when positioning the control target of the urban road network traffic signal. According to the traffic flow basic diagram theory, when the traffic capacity of the road section is fixed, and the traffic flow density is lower, the traffic flow can ensure to run on the road section at a higher speed, and travel delay is effectively reduced. Therefore, the aim of relieving and suppressing the traffic flow congestion of the urban road network can be achieved by promoting the number of the travel vehicles on each road section on the road network to be at a lower level. Defining a reward function r _i of the intelligent agent i as the number of vehicles at the inlet road section of the associated signalized intersectionFunction of change:

As an alternative embodiment, training the plurality of original reinforcement learning models based on the plurality of sample state data, the plurality of sample traffic signal data, the plurality of feedback traffic state data, and the predetermined reward function, comprises: determining a plurality of rewards corresponding to the sample state data one by one according to the sample state data, the feedback traffic state data and the rewards function; determining a plurality of learning rates according to a plurality of Nash equilibrium solutions obtained before a plurality of original reinforcement learning models output a plurality of sample traffic signal data, wherein the plurality of learning rates respectively represent the magnitude of changing the parameter of the reinforcement learning model during training; training a plurality of original reinforcement learning models according to the plurality of reward values and the plurality of learning rates.

Optionally, the learning rate of the agent is adaptively adjusted, and the learning rate is respectively indicative of the magnitude of the parameter changing the reinforcement learning model during training, and should be determined according to the task objective of the agent, i.e. the learning of the signal control scheme.

Analyzing a game mechanism of the Nash equilibrium of the mixed strategy according to steady-state expectation of distributed traffic signal control of the urban road network, wherein when steady-state conditions are achieved, the self experience distribution of the intelligent agents is consistent with the equilibrium solution of Nash equilibrium, and the experience distribution of the competitor intelligent agents is also consistent with the equilibrium solution of Nash equilibrium of the mixed strategy, namely the experience distribution among the local intelligent agents completely meets the condition of the equilibrium solution of Nash equilibrium of the mixed strategy. At this time, the learning rate should be kept from the existing learning experience without experience update, i.e., the learning rate of the agent approaches 0. When steady state conditions are not achieved, the agent should adaptively adjust the rate of empirical updates, where the evaluation basis employed is the difference between the self empirical distribution and the mixing strategy Nash equilibrium solution, but even if the difference between the distributions is extremely large, the learning rate should be controlled within limits approaching 1. Thus, learning rate α _i of agent i can be distributed by self-empirical distribution Pr _i and optimal decision (Nash equalization solution)Is obtained by the difference of (a):

the function JS is a Jensen-Shannon divergence model for measuring the difference between two probability distributions:

KL (·) represents the KL divergence, the first term on the right can be expressed by the following formula:

The use of Jensen-Shannon divergence to construct a specific learning rate mechanism has two advantages: symmetry: the symmetry avoids the asymmetric influence caused by the difference of specific sequences during the comparison of the Kullback-Leibler divergence to the probability distribution, and can carry out standardized measurement on the difference between the signal control intelligent agent experience distribution and the optimal decision distribution; fixed value range: since the Jensen-Shannon divergence range has a fixed range, and is applicable to the learning rate α _i constraint of agent i [0,1 ].

The task requirement of the signal control intelligent agent can be met through the analysis, the learning rate of the signal control intelligent agent is set by utilizing the mapping mechanism, the sensitivity of multi-intelligent agent reinforcement learning in the signal control intelligent agent architecture can be improved, namely, the sensitivity of the signal control intelligent agent to the influence of dynamic random fluctuation is improved on the premise that the convergence is ensured by the learning rate mapping mechanism.

In this optional embodiment, the learning speed of the agent can be adjusted according to the learning rate, and parameters inside the agent are adjusted through the reward value to train the agent.

As an alternative embodiment, the method further comprises: acquiring position data and speed data of an emergency vehicle in a target intersection and the length of an emergency lane of the target intersection, wherein the length of the emergency lane is determined by the number of lanes of the target intersection and the traffic flow density and the flow rate of the target intersection in the parking, semi-saturated flow and saturated flow states; and under the condition that the emergency vehicle reaches the starting position of the emergency lane, controlling the traffic light of the target intersection to be a green light.

Alternatively, the invention proposes signal lamp control for emergency vehicles. Four inlet channels at the intersection with symbols of respectively. The entrance way can be provided with a reverse left-turning lane and allows an emergency vehicle to borrow and pass through, the lane has the functions of straight running, left turning, reverse left turning and emergency, and the signs are respectively set as/>。

When an emergency vehicle enters a speed guiding area, the lane function is changed from reverse left to emergency, and the following constraint is satisfied:

In the method, in the process of the invention, The number of lanes for lane function Ω for the d-direction entrance lane; /(I)Is a binary variable,/>Indicating that the lane function is reverse left turn, otherwise, emergency.

The reverse left-turn lane length setting affects the reasonable allocation of space-time resources at the intersection. According to the shock wave theory, the reverse left-turning lane length L _d is designed by taking the traffic flow as the saturation flow rate and the intersection as the target, and the following formula is shown:

Wherein L _d is the reverse left turn lane length; q ₁ is one cycle of left hand traffic (pcu/h); h ₀ is the headway; n ₁、n₂ is the number of vehicles turning left and turning left in reverse; k ₁、k₂、k₃ and q ₁、q₂、q₃ are traffic flow density and flow rate at park, semi-saturated and saturated flow conditions, respectively.

Fig. 7 is a schematic view of emergency vehicle traffic provided in accordance with an alternative embodiment of the present invention, as shown in fig. 7, with the emergency vehicle being constrained by the front vehicle blocking and emergency lane opening positions, with a minimum travel time fixed in the speed guidance area. Meanwhile, the pre-signal control empties the emergency lane, and the higher the speed of the emergency vehicle, the less time is spent in the emergency lane. Thus, the vehicle speed is guided to the highest vehicle speed into the emergency lane and kept at the optimal strategy. However, when the emergency vehicle turns left, the maximum speed is kept to cause overlarge turning speed, and accidents such as rollover and the like are easy to occur, so that the emergency vehicle needs to be guided to the maximum safe speed through speed. When the emergency lane reaches the intersection area, the lane function of the reverse left-turning lane is changed into the emergency lane, the control rule of the pre-signal is to ensure that the emergency vehicle runs preferentially, and the green light of the pre-signal is started and the duration is determined by the emergency vehicle. Green light on time: when the emergency vehicle reaches the starting position of the emergency lane, the pre-signal green light is started.

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present invention. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

From the above description of the embodiments, it will be clear to those skilled in the art that the traffic signal control method based on the multi-agent reinforcement learning model according to the above embodiments may be implemented by means of software plus a necessary general hardware platform, and of course may also be implemented by hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present invention.

According to an embodiment of the present invention, there is further provided a traffic signal control device based on a multi-agent reinforcement learning model for implementing the traffic signal control method based on a multi-agent reinforcement learning model, and fig. 8 is a block diagram of a traffic signal control device based on a multi-agent reinforcement learning model according to an embodiment of the present invention, as shown in fig. 8, the traffic signal control device based on a multi-agent reinforcement learning model includes: the acquisition module 82 and the control module 84 are described below as the traffic signal control device based on the multi-agent reinforcement learning model.

The obtaining module 82 is configured to obtain traffic state data of each of the plurality of intersections in the target area, where the traffic state data of each of the plurality of intersections includes position data and speed data of vehicles in roads directly connected to the corresponding intersection.

The control module 84 is connected to the obtaining module 82, and is configured to input traffic state data of each of the plurality of intersections into a plurality of reinforcement learning models trained in advance, and output a plurality of traffic signal data by the plurality of reinforcement learning models, where the plurality of traffic signal data are used to control signal lamps of the plurality of intersections, the plurality of reinforcement learning models are in one-to-one correspondence with the plurality of intersections, and the plurality of reinforcement learning models select the plurality of traffic signal data from a plurality of action sets based on a nash equalization strategy, where the plurality of action sets are action sets of each of the plurality of reinforcement learning models, and the plurality of action sets include selectable traffic signal data.

It should be noted that, the above-mentioned obtaining module 82 and the control module 84 correspond to step S202 to step S204 in the embodiment, and the plurality of modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above-mentioned embodiment. It should be noted that the above-described module may be operated as a part of the apparatus in the computer terminal 10 provided in the embodiment.

Embodiments of the present invention may provide a computer device, optionally in this embodiment, the computer device may be located in at least one network device of a plurality of network devices of a computer network. The computer device includes a memory and a processor.

The memory may be used to store software programs and modules, such as program instructions/modules corresponding to the traffic signal control method and apparatus based on the multi-agent reinforcement learning model in the embodiments of the present invention, and the processor executes various functional applications and data processing by running the software programs and modules stored in the memory, that is, implements the traffic signal control method based on the multi-agent reinforcement learning model. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory may further include memory remotely located relative to the processor, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The processor may call the information and the application program stored in the memory through the transmission device to perform the following steps: acquiring traffic state data of each of a plurality of intersections in a target area, wherein the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections; the traffic state data of each intersection is respectively input into a plurality of reinforcement learning models trained in advance, the plurality of reinforcement learning models are respectively used for outputting a plurality of traffic signal data, the plurality of traffic signal data are respectively used for controlling signal lamps of the intersections, the reinforcement learning models are in one-to-one correspondence with the intersections, the reinforcement learning models are respectively used for selecting the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies, the action sets are respectively used for the reinforcement learning models, and the action sets respectively comprise selectable traffic signal data.

By adopting the embodiment of the invention, a traffic signal control scheme based on a multi-agent reinforcement learning model is provided. The method comprises the steps of obtaining traffic state data of each of a plurality of intersections in a target area, wherein the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections; the traffic state data of each intersection is respectively input into a plurality of reinforcement learning models trained in advance, the plurality of reinforcement learning models respectively output a plurality of traffic signal data, wherein the plurality of traffic signal data are respectively used for controlling signal lamps of the intersections, the plurality of reinforcement learning models are in one-to-one correspondence with the plurality of intersections, the plurality of reinforcement learning models respectively select the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies, the plurality of action sets are the action sets of each reinforcement learning model, and the plurality of action sets respectively comprise selectable traffic signal data, so that the aim of controlling the signals of the reinforcement learning models based on Nash equilibrium strategies is fulfilled, the connection of the reinforcement learning models is reinforced, other learning models are considered when each reinforcement learning model is used for controlling the signals, the effect of improving the traffic signal control is realized, and the technical problem that the effect of traffic signal control is poor due to the fact that when the reinforcement learning models are used for controlling the signals of the intersections respectively, decisions made by the reinforcement learning models are independent is solved.

Those skilled in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing a terminal device to execute on associated hardware, the program may be stored in a non-volatile storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

Embodiments of the present invention also provide a nonvolatile storage medium. Alternatively, in the present embodiment, the above-described nonvolatile storage medium may be used to store the program code executed by the traffic signal control method based on the multi-agent reinforcement learning model provided in the above-described embodiment.

Alternatively, in this embodiment, the above-mentioned nonvolatile storage medium may be located in any one of the computer terminals in the computer terminal group in the computer network, or in any one of the mobile terminals in the mobile terminal group.

Optionally, in the present embodiment, the non-volatile storage medium is arranged to store program code for performing the steps of: acquiring traffic state data of each of a plurality of intersections in a target area, wherein the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections; the traffic state data of each intersection is respectively input into a plurality of reinforcement learning models trained in advance, the plurality of reinforcement learning models are respectively used for outputting a plurality of traffic signal data, the plurality of traffic signal data are respectively used for controlling signal lamps of the intersections, the reinforcement learning models are in one-to-one correspondence with the intersections, the reinforcement learning models are respectively used for selecting the plurality of traffic signal data from a plurality of action sets based on Nash equilibrium strategies, the action sets are respectively used for the reinforcement learning models, and the action sets respectively comprise selectable traffic signal data.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a non-volatile storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A traffic signal control method based on a multi-agent reinforcement learning model is characterized by comprising the following steps:

Acquiring traffic state data of each of a plurality of intersections in a target area, wherein the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections;

The traffic state data of each intersection is respectively input into a plurality of reinforcement learning models trained in advance, the plurality of reinforcement learning models are respectively used for outputting a plurality of traffic signal data, the plurality of traffic signal data are respectively used for controlling signal lamps of the intersections, the reinforcement learning models are in one-to-one correspondence with the intersections, the reinforcement learning models are respectively used for selecting the traffic signal data from a plurality of action sets based on Nash equilibrium strategies, and the action sets are action sets of each reinforcement learning model and respectively comprise selectable traffic signal data.

2. The method of claim 1, wherein the inputting the traffic state data of each of the plurality of intersections into a plurality of reinforcement learning models trained in advance, outputting the plurality of traffic signal data for controlling the plurality of intersections by the plurality of reinforcement learning models, respectively, comprises:

Inputting target traffic state data of a target intersection into a target reinforcement learning model, and determining respective game gain function values of traffic signal data which can be selected in a target action set based on a preset game gain function and the target traffic state data by the target reinforcement learning model, wherein the target action set is traffic signals which can be selected by the target reinforcement learning model;

the target reinforcement learning model obtains a target Nash equilibrium solution corresponding to the target intersection according to the game income function value based on the Nash equilibrium strategy, wherein the target Nash equilibrium solution is probability distribution of selecting data in the target action set;

The target reinforcement learning model selects target traffic signal data from the target action set according to the target Nash equilibrium solution, wherein the target traffic signal data is used for controlling signal lamps of the target intersection;

And respectively inputting the traffic state data of each of the plurality of intersections into a plurality of reinforcement learning models trained in advance by adopting a mode of obtaining the target traffic signal data, and respectively outputting the plurality of traffic signal data for controlling the plurality of intersections by the plurality of reinforcement learning models.

3. The method of claim 2, wherein the target reinforcement learning model obtains a target nash equalization solution corresponding to the target intersection based on the nash equalization strategy according to the game profit function value, comprising:

acquiring adjacent traffic state data of an adjacent intersection of the target intersection and experience distribution of an adjacent reinforcement learning model corresponding to the adjacent intersection, wherein the experience distribution of the adjacent reinforcement learning model is probability distribution of selecting data in an adjacent action set by the adjacent reinforcement learning model under the condition that input data are the adjacent traffic state data;

The target reinforcement learning model predicts adjacent Nash equilibrium solutions corresponding to the adjacent intersections, which are obtained by the adjacent reinforcement learning model based on the Nash equilibrium strategy, according to the experience distribution of the adjacent reinforcement learning models;

and the target reinforcement learning model obtains the target Nash equilibrium solution according to the adjacent Nash equilibrium solution and the game income function value.

4. The method of claim 2, wherein the target reinforcement learning model selects target traffic signal data from the set of target actions based on the target nash equalization solution, comprising:

The target reinforcement learning model randomly selects the target traffic signal data in the target action set based on a probability distribution of the target Nash equilibrium solution characterization.

5. The method of claim 1, wherein a target reinforcement learning model of the plurality of reinforcement learning models is trained by:

acquiring a plurality of sample state data;

Respectively inputting the plurality of sample state data into a plurality of original reinforcement learning models, and respectively selecting a plurality of sample traffic signal data for output from the plurality of operation sets preset by the plurality of original reinforcement learning models based on the Nash equilibrium strategy, wherein the plurality of sample traffic signal data are in one-to-one correspondence with the plurality of sample state data, and the plurality of original reinforcement learning models are original models of the plurality of reinforcement learning models respectively;

under the condition that a plurality of sample intersections are controlled to be respectively in the plurality of sample state data, signal lamp control is carried out on the plurality of sample intersections by adopting sample traffic signals respectively corresponding to the plurality of sample state data, and the traffic state data of the plurality of sample intersections after control are determined to be a plurality of feedback traffic state data, wherein the plurality of feedback traffic state data are in one-to-one correspondence with the plurality of sample state data;

Training the plurality of original reinforcement learning models according to the plurality of sample state data, the plurality of sample traffic signal data, the plurality of feedback traffic state data and a preset reward function.

6. The method of claim 5, wherein training the plurality of raw reinforcement learning models based on the plurality of sample state data, the plurality of sample traffic signal data, the plurality of feedback traffic state data, and a predetermined reward function comprises:

determining a plurality of reward values corresponding to the sample state data one to one according to the sample state data, the feedback traffic state data and the reward function;

Determining a plurality of learning rates according to a plurality of Nash equilibrium solutions obtained before the plurality of original reinforcement learning models output the plurality of sample traffic signal data, wherein the plurality of learning rates respectively represent the magnitude of changing the parameter of the reinforcement learning model during training;

training the plurality of original reinforcement learning models according to the plurality of reward values and the plurality of learning rates.

7. The method according to any one of claims 1 to 6, further comprising:

Acquiring position data and speed data of an emergency vehicle in a target intersection and the length of an emergency lane of the target intersection, wherein the length of the emergency lane is determined by the number of lanes of the target intersection and the traffic flow density and the flow rate of the target intersection in the parking, semi-saturated flow and saturated flow states;

And controlling the traffic light of the target intersection to be a green light under the condition that the emergency vehicle reaches the starting position of the emergency lane.

8. A traffic signal control device based on a multi-agent reinforcement learning model, comprising:

The system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring traffic state data of each of a plurality of intersections in a target area, and the traffic state data of each of the plurality of intersections comprises position data and speed data of vehicles in roads directly connected with the corresponding intersections;

The control module is used for respectively inputting the traffic state data of each intersection into a plurality of reinforcement learning models trained in advance, respectively outputting a plurality of traffic signal data by the reinforcement learning models, wherein the plurality of traffic signal data are respectively used for controlling signal lamps of the intersections, the reinforcement learning models are in one-to-one correspondence with the intersections, and the reinforcement learning models respectively select the traffic signal data from a plurality of action sets based on Nash equilibrium strategies, wherein the action sets are action sets of each reinforcement learning model, and the action sets respectively comprise selectable traffic signal data.

9. A non-volatile storage medium, characterized in that the non-volatile storage medium comprises a stored program, wherein the device in which the non-volatile storage medium is controlled to execute the traffic signal control method based on the multi-agent reinforcement learning model according to any one of claims 1 to 7 when the program is run.

10. A computer device, comprising: a memory and a processor, wherein the memory is configured to store,

The memory stores a computer program;

The processor is configured to execute a computer program stored in the memory, and the computer program when executed causes the processor to execute the traffic signal control method based on the multi-agent reinforcement learning model according to any one of claims 1 to 7.