GB2607880A

GB2607880A - Traffic control system

Info

Publication number: GB2607880A
Application number: GB2108352.2A
Authority: GB
Inventors: Howell Shaun; Yasin Ahmed; Knutins Maksis; Mooroogen Krishna
Original assignee: Vivacity Labs Ltd
Current assignee: Vivacity Labs Ltd
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2022-12-21
Also published as: WO2022258943A1; GB202108352D0

Abstract

Traffic signals at multiple junctions are controlled using a traffic control agent subsystem to send traffic signal stage requests to traffic signal controllers controlling the traffic signals. The traffic control agent subsystem receives input data from multiple sensors for monitoring vehicles or other road users at each junction. It includes a machine learning agent trained by reinforcement learning. The agent comprises a neural network including: a shared state embedding subnetwork 12 comprising an input layer, one or more connected hidden layers and a shared state output layer; a global value subnetwork 16 comprising an input layer connected to the shared state output layer, one or more connected hidden layers, and a global value output layer; for each junction, an advantage subnetwork 14a, 14b comprising an input layer connected to the shared state output layer, a plurality of connected hidden layers, and a junction advantage output layer; and an aggregation layer for combining the global value output layer with each of the junction advantage layers to output a junction value layer.

Description

TRAFFIC CONTROL SYSTEM

The present invention relates to a traffic control system, in particular a system utilising an intelligent agent trained by reinforcement learning to control traffic by controlling the traffic signals at, for example, multiple junctions in a town or a city.

BACKGROUND TO THE INVENTION

Traffic in a city is controlled primarily by traffic signals at junctions. At a very basic level traffic signals keep junctions safe by ensuring that only vehicles coming from particular lanes are able to enter the junction at one time, reducing the risk of collision. At busy junctions, signal control provides a clear advantage over just setting out rules as to "rights of way" and relying on drivers to comply with them, since the signal control should ensure that all drivers are given access to the junction within a reasonable length of time, reducing frustration and managing fair and safe access to the shared road space.

Traffic signals at junctions are preferably configured as far as possible to keep traffic moving and ensure that the available road space is utilised in the most efficient way. Hence it is common to provide at least some sensors at junctions so that access to the junction is provided taking into account current demand from particular directions, i.e. queues of traffic approaching the junction from a particular lane. Traffic signals at junctions may also be controlled in an attempt to optimise according to certain other goals, for example ensuring that buses run on time by controlling traffic to keep bus routes clear as a priority.

More sophisticated control of traffic signals at a junction has been proposed. Such control can be used to further transportation policies such as reducing congestion, reducing pollution and fuel use, improving road safety, and encouraging use of public transport. The applicant's previous patent application W02020225523 discloses a machine learning agent, primarily trained by reinforcement learning in a simulation. The agent is optimised by its training to maximise performance against goals which can be set according to current policy objectives. The agent will change its strategy if the goals change, and also continually adapts to changes in traffic patterns caused by various external factors. As such, these reinforcement-learning based agents provide very flexible traffic control system which avoid the need for manual, expensive and often non-optimal calibration at regular intervals.

The agents of W02020225523 each control a single junction. Although there may be an element of communication between agents, each junction is essentially controlled by its own trained agent. The extent to which traffic flow through an entire city-wide network can be optimised is therefore limited. An agent controlling a single junction may make what appears to the agent, according to its training, to be a good decision, but which creates a state in the network as a whole which makes things difficult (i.e. reduces the expected reward value of available actions) for other agents.

A single neural network-based agent can be trained using reinforcement learning to control multiple junctions. Controlling two or more junctions is not conceptually very different from controlling one particularly large and complex junction. However, the time taken to train an agent to a point where it will perform well increases as the complexity of the junction increases. The complexity of a single neural network increases exponentially with the number of junctions. Even with the parallelised simulation-based training disclosed, an agent which controls even a few tens of junctions (perhaps the central area of a small town, certainly far short of a major city) takes too long to train. Since one of the key advantages of these machine-learning based systems over manual calibration is the ability to continually re-train and redeploy agents according to changing circumstances and changing priorities, long training times are undesirable and very long training times make the system useless.

The effectiveness of the agent training, even when agents are trained for a very long time, also becomes more variable as complexity increases. This is because as the action space grows, the feedback information given to the agent during training relates to only a small part of the action space.

It is an object of the present invention to provide an intelligent agent which may be 25 trained to control multiple junctions across a town or city to optimise the flow of traffic according to policy goals.

STATEMENT OF INVENTION

According to the present invention, there is provided a traffic control system for use in controlling a road network comprising multiple junctions, the traffic control system 30 comprising: a plurality of sensors for monitoring vehicles and/or other road users at and around each junction; a traffic control agent subsystem and traffic signals including signal outputs for controlling the vehicles and/or other road users at each junction, the sensors providing inputs to the traffic control agent subsystem, and the 5 traffic control agent subsystem controlling the traffic signals to optimise traffic flow in accordance with one or more goals, in which the traffic control agent subsystem includes a machine learning agent trained by reinforcement learning, the machine learning agent comprising a neural network including: a shared state embedding subnetwork comprising an input layer, one or more connected hidden layers and a shared state output layer; a global value subnetwork comprising an input layer connected to the shared state output layer, one or more connected hidden layers, and a global value output layer, for each junction, an advantage subnetwork comprising an input layer connected to the shared state output layer, a plurality of connected hidden layers, and a junction advantage output layer; an aggregation layer for combining the global value output layer with each of the junction advantage layers to output a junction value layer.

Each junction advantage subnetwork is independent of the other junction advantage subnetworks. The neural network as a whole is "branched", with a branch per junction being controlled. This means that the complexity scales about linearly with the number of junctions, and so networks can be realistically produced to control, for example, all junctions in a city. At the same time the global value subnetwork ensures that the global (city-wide) context is provided when training the network, so that the effects of decisions made on the road network as a whole are taken into account.

The network uses what is known as a "dueling" architecture. When trained, the global value output layer can be thought of as representing the overall value of a state of the roads. The junction advantage output layer associated with a particular junction represents the advantage, or improvement, expected for each alternative action which could be taken at the junction. After aggregating this vector of action-advantages with the global value output layer, a vector representing the expected value of the state of the roads, following taking of each action, is calculated. This is the 0-value vector of the well-known 0-learning algorithm. The network can be updated by substituting elements of the 0-value vector for known observed results, calculating a loss vector, and updating by backpropagation.

Note that the aggregation layer performs a static aggregation to its inputs. The aggregation layer is not a layer with learnable parameters. Hence the 0-value vector is always calculated in a consistent way from the junction advantage output layers and the global value output layer.

The network is primarily trained using a simulation model. In particular, a road network simulation model may accept inputs of traffic scenarios and inputs of control decisions, and provide outputs of traffic patterns as a result of the control decisions made. The road network simulation model simulates an entire road network, spanning for example at least a substantial area of a town or city. The road network includes multiple junctions. The neural network agent may be trained by applying control decisions made by the agent to the road network simulation model to collect "memories". The control decisions are made as a result of the output of the junction value layer of the network when the network is applied to a particular input state. The output of the junction value layer is an expected value associated with each possible action. When being trained in the simulation, the agent may take actions both in an "exploitation" mode, where the action is taken which is expected to be the best action, i.e. the action with the best expected value according to current learning, and in an "exploration" mode, where the agent deviates from time to time from the "best" action in order to explore the policy space.

Because the network is trained using a simulation model, training may take place at high speed (i.e. anything faster than real-time, but potentially much faster). Also, training may be parallelised, whereby multiple copies of agents train in multiple copies of the simulation model, noting that due to the exploration which may take place, the same agent in the same simulation may make different choices and thus collect different memories. The memories which are built up by operating the agents in simulations form the basis of updating the agents by reinforcement learning.

In a preferable embodiment, and as described in W02020225523, agents may be continually trained in an agent training system, while a "current best" agent is deployed in a live traffic control system to actually control the traffic in the road network. The agent in the live traffic control system may be replaced as and when a better agent becomes available from the agent training system. In addition, agents may learn from real memories collected by the agent controlling real traffic in the live traffic control system. Such memories may be used to update the agent in the live traffic control system in some embodiments, and/or may be shared with the agent training system to be used in updating models currently being trained, in addition to the use of memories from simulations.

W02020225523 discusses in more detail the different options available in terms of training agents in simulations and/or in a live system, for deployment of a best agent 10 at a particular time into a live system. The full description of W02020225523 is incorporated herein by reference.

Although training in simulations allows parallelization and faster-than-real-time learning, the action space across a large number of junctions, and therefore the potential space in which exploration can take place, is large. The action space grows exponentially with the number of junctions. Therefore, embodiments of the invention may use techniques to engineer the actions chosen, to maximise learning and therefore convergence to good control strategies, while ensuring the agent training system can be realistically implemented with available hardware and that training can be completed in a reasonable amount of time.

In some embodiments, exploration, i.e. the agent making a choice other than the best choice according to current learning, may be allowed one junction at a time. The agent is controlling multiple junctions in a simulation, and making control decisions in relation to all of them. A single junction may be nominated for the duration of a training episode (i.e. the agent being run on the simulator on a particular scenario) in relation to which exploration is allowed. In other embodiments, a different junction may be nominated for exploration each time a decision is made. Preferably, the junctions are cycled through so that each junction gets an equal opportunity for exploration. In other embodiments, a junction could be selected at random for exploration every time a decision is made.

In these embodiments, where a particular junction is selected at any one time in which exploration is allowed, there can still be a random aspect as to whether exploration actually happens. In particular, an "exploration temperature", E, may be used to quantify how likely the agent is to take a random exploration action. The probability of the agent taking the best action according to its current learning is therefore (1 -s).

The value of £ may be reduced as the training episode progresses, so that random exploration becomes less likely towards the end of the episode, and "exploitation", i.e. using the best learned strategy, becomes more common.

Other approaches, such as Boltzmann exploration or noisy networks, may be used as 5 well or instead. The object in all cases is to create, in the simulations, a set of transitions which can be usefully used to update the neural network and improve its predictive performance in a reasonable amount of time.

Each memory is a list of transitions. A transition consists of a (state, action(s), reward, next state) tuple. Note that in some embodiments a transition could include plural actions in the sense that signal changes may have been made at multiple junctions. However in many embodiments the action space is considered junction-by-junction as described in more detail below. The state, next state and reward are generally at the level of the whole road network. Each transition describes a situation an agent was faced with, the decision it took, the immediate reward it received, and the state that it then ended up in. The reward is calculated according to a reward function which may be defined according to objectives which the managers of the road network want to achieve. The reward function may take into account different goals with different weights. For example, a higher reward will be given when waiting times are lower, but when configuring the reward function a choice may be made, for example, to apply more weight to the waiting time for buses than for cars, to try to encourage use of public transport.

To update the network when it "learns" from its memories, for each transition the starting state is forward-propagated through the network. This generates, according to the current "knowledge" of the network, a vector of 0-values, or "expected maximum total future rewards" for each action which could be taken at each junction. A "ground truth" is then calculated by substituting into the vector the actual immediate reward for the actual action(s) taken in the transition, plus a weighted estimate of total future rewards in the next state associated with that transition. A loss function is then calculated and backpropagation can take place to update the network.

Preferably, a loss vector is calculated for a plurality of transitions, and the plural loss vectors are aggregated into a loss matrix. The loss matrix is then used to derive one or more scalar loss values and this value, or values, are used to update the network by backpropagation.

When calculating the "ground truth", an estimate of future rewards has to be made for the next state associated with the transition. A bootstrapping technique is used. The "next state" is forward-propagated through (a copy of) the network to obtain an estimate of the value of the next state, i.e. the estimated future rewards. This estimate of future rewards is then used together with the real observation (from the simulation) of the immediate reward of the action, to form the basis of the "ground truth" used to calculate the loss function. However, it has been found that using the current version of the network to estimate future rewards can lead to instability and over-estimation, and so in preferred embodiments an older version (i.e. prior to some update steps) of the network, rather than an exact copy of the network being updated, is used to calculate the expected value of the next state. This is found to lead to significantly better performance.

The junctions preferably take actions asynchronously, i.e. there is no requirement that the traffic signals at every junction change at the same time. For this reason, in some embodiments transitions can contain a change of state for multiple junctions, but normally only for a subset of the junctions. In some embodiments the action space is considered junction-by-junction and therefore a transition contains exactly one action. When the neural network is updated, the weights on advantage subnetworks are only updated for the subnetworks associated with junction(s) which changed state in the relevant transition(s). This avoids backpropagating zero error vectors.

Where update processes can take place based on transitions having actions affecting varying numbers of junctions, the extent to which weights are updated in the shared state embedding subnetwork may be modulated according to the number of junction actions associated with the transitions used in the update. For example, a transition (or minibatch of transitions) in which the signals change at four junctions can be used to update the advantage subnetworks associated with those four junctions, and to update the shared state embedding subnetwork and global value subnetwork. The advantage subnetworks associated with junctions in which the traffic signals did not change in that transition (or minibatch) will not be updated at all. For a transition (or minibatch) in which the signals change at only one junction, again only the advantage subnetwork associated with that junction, and not any of the other advantage subnetworks, will be updated by backpropagation. The shared state embedding subnetwork and global value subnetwork may be updated as well, but the extent to which the update is allowed to affect the weights in those subnetworks may be reduced, compared with the update based on the transition / minibatch in which an action took place at multiple junctions.

Update processes may take place only for a subset of the transitions collected as a result of simulation-based training. A key advantage of simulation-based training is the 5 ability to collect large numbers of memories in a short period of time, including by parallelizing the exploration stage. However, the update stage cannot be parallelized to the same extent and is computationally intensive. Therefore, preferably a subset of transitions is chosen for use in updating the network. In some embodiments only a subset of the transitions associated with a particular run of an agent in a simulation 10 may be used in the update stages. Hence parallelizing the exploration stage still has the advantage that the transitions which end up being used in updating come from wide exploration across a diversity of states, but the updating can complete in a reasonable amount of time.

It has been found that training the described duelling 0-network can be unstable.

During training, the agent's performance can improve gradually, then rapidly become worse, then start to improve gradually again. To combat this, multiple agents may be trained using different samples from the store of transitions stored from the simulations. In other words, the whole training process may be repeated multiple times (for example, about 10 times). The multiple trained agents can then be evaluated by testing their performance in the simulation, in order to choose a "best" agent. The best agent may, possibly subject to further tests for suitability, be deployed to control a junction.

Separately, within each training run, the agent being trained may be tested at regular intervals. At the end of the training run, the best of the intermediate agents may be chosen, rather than the agent produced by the final update. The effect is essentially to identify when "overtraining" starts to make the agent worse, and take a copy of the good agent before that happened.

Note that in a case where the machine learning agent is trained in a simulation and then used to control traffic, without being allowed to further learn from "real world" memories, the global value subnetwork can be discarded from the version of the agent which is deployed to control traffic. The global value subnetwork is not required to make decisions as to the next action according to the best strategy currently learned, but is used during training of the neural network to ensure that the overall (e.g. citywide) state of the roads is taken into account when updating weights.

Note also that discarding the global value subnetwork does not prevent memories being saved while the agent runs in the live traffic control system. These memories can still be used to train networks (complete networks which include the global value subnetwork) in the agent training system.

The input data provided to the neural network, at the input layer of the shared state embedding subnetwork, is envisaged to be engineered input data -for example, data indicating queue lengths and types of vehicles waiting at different lanes, etc. The input data may be generated by other neural networks, for example convolutional neural networks trained to recognise features in video feeds from cameras at junctions.

However, the design and training of such networks is outside the scope of this disclosure, and it is envisaged that this would be separate from, and potentially use very different techniques, from the reinforcement learning used to train the decision-making agent. The applicant's previous application W02018051200 describes identification and tracking of objects in a video feed.

Examples of input data which may be provided include queue length, speed, time of day, blocked junction exits, rolling mean queue length, and time since a pedestrian push button was pushed

BRIEF DESCRIPTION OF THE DRAWING

For a better understanding of the present invention, and to show more clearly how it 20 may be carried into effect, reference will now be made by way of example only to the accompanying drawing, in which: Figure 1 shows an outline schematic of a neural network traffic control agent according to the invention.

DESCRIPTION OF PREFERRED EMBODIMENTS

Referring to Figure 1, the structure of a neural network traffic control agent, used in the invention, is shown.

The agent is used to control traffic signals at junctions. In a road network in an urban area, for example a town or a city, there will be a large number of junctions. Each junction has traffic signals and a traffic signal controller controls the signals at each junction. A "junction" is not defined exactly in terms of the underlying structure of a road network -there may be borderline cases where a particular set of traffic signals might be controlled as one single complex junction with one traffic signal controller, or alternatively might be controlled by more than one traffic signal controller as multiple, individually more straightforward, junctions. For these purposes therefore a "junction" means the part of the road network controlled by a single traffic signal controller. "City" is used as a shorthand to describe extent of the wider road network, which comprises multiple junctions and is controlled by the described traffic control agent. Of course, the "city" may be a town, suburb, or any other area which has a road network comprising multiple junctions.

A traffic signal controller can control traffic signals at its junction independently and autonomously. Indeed, it is important that the traffic signal controllers remain able to do this, so that traffic signals at junctions continue to cycle through their stages in the event of a malfunction of, or loss of communication with, the traffic control agent. However, the traffic signal controllers accept external input from the traffic control agent. The input to a traffic signal controller is a requested stage of the traffic signals at the respective junction. A "stage" is defined by which green (go) signals are showing on which lanes coming into the junction. The traffic signal controller will apply rules in order to get to that requested stage, if it can. Typically, a traffic signal controller will accept a request to move to a particular stage if its rules allow it to go to the stage directly, or to go to a stage (referred to as a via stage) from which the requested stage can then be got to directly. Moving from one stage (defined by the green signals) to the next stage may take a period of time. For example, in the UK changing a signal from red to green involves showing red and amber for a few seconds. Also, in a particular junction it may be necessary for example to wait for a length of time after one signal has been changed to red, before another signal can be changed to green.

Input data 10 to the traffic control agent is encoded into an input layer. The input data 10 defines the current state of the road network, in as much detail as possible. The input data includes information as to the current state of all traffic signals in the network, as well as information as to the current state of traffic -e.g. where there are queues of traffic, how long the queues are, what types of vehicles (cars, buses, vans, lorries etc.) are where, whether there are pedestrians waiting to cross the road at controlled crossings, and so on. Engineering of the input data is outside the scope of this disclosure, but various sensors and techniques will be familiar to the skilled person. In particular, some input data may come from cameras as described in W02018051200.

The input data is processed by a shared state embedding subnetwork 12. This subnetwork is a neural network comprising one or more connected hidden layers. For 35 example, there may be one or two hidden layers with a few thousand nodes in each layer. The width of the subnetwork, i.e. the number of nodes per layer, is expected to scale at worst about linearly with the number of junctions in the city.

The output layer of the shared state embedding subnetwork 12 may be thought of as being a representation of the state of the network (from the input layer) having been processed to recognise and emphasise pertinent features according to the learned weights in the shared state embedding subnetwork 12. The output layer of the shared state embedding subnetwork 12 is "copied" as the input to each one of the junction advantage subnetworks 14a, 14b, and as the input to the global value subnetwork 16.

The global value subnetwork 16 is a neural network comprising one or more connected hidden layers. Again, for example there may be one or two hidden layers with a few thousand nodes in each layer. Again the width of the subnetwork is expected to scale at worst linearly with the number of junctions in the city. The output layer of the global value subnetwork 16 may be thought of as a representation of the value of the current state, as represented by the input data 10. The value, in the context of this reinforcement learning system, is dependent on the maximum expected future reward available starting at this state. Since the reward function is defined according to traffic management goals such as reducing congestion, reducing pollution, and ensuring public transport services run on time, the "value" of a particular state may be directly connected with how "good" the traffic situation in the city currently is. The global value subnetwork 16, if trained successfully, will learn to accurately predict the expected future reward associated with states, and therefore how good a particular state is.

There is a junction advantage subnetwork 14a, 14b associated with each junction in the city road network. In the diagram, just two junction advantage subnetworks 14a, 14b are shown. However, in a typical embodiment in a city, for example, there could be fifty or more junctions, and the same number of junction advantage subnetworks 14. Although in most cases it will be obvious what constitutes a single junction in a road network, occasionally there may be complexities. For the avoidance of doubt, a junction is simply defined as the area controlled by a group of traffic signals which are controlled together and associated with one of the junction advantage layers 14. In some cases it may be a borderline decision as to whether to treat a particular group of signals as part of one junction or two -for example where there is a pedestrian crossing on a road in a position not far from a signal-controlled junction. This decision may be made as part of designing embodiments of the system.

Each junction advantage subnetwork is a neural network comprising one or more connected hidden layers. For example, there may be one or two hidden layers with about one or two thousand nodes in each layer.

The output layer of each junction advantage subnetwork 14a, 14b represents the 5 expected advantage of each action which could be taken at that junction, given the current state according to input data 10.

The output of each junction advantage subnetwork 14a, 14b is aggregated with the output of the global value subnetwork 16. This is indicated in Figure 1 by the intersections at 18a, 18b. In the literature these are often referred to as "aggregation layers" and this convention is respected here, but they are not like the layers of a neural network because they have no learnable parameters. The aggregation layers 18a, 18b consistently and deterministically calculate the predicted value, i.e. expected future reward, for each action which could be taken at each junction. This includes a component of the estimated value now, in the current state (from the global value subnetwork 16) and a component of the estimated advantage of each action (from the junction advantage subnetworks 14). After aggregation, the output of the whole network is a vector of values associated with each action which could be taken, at the state represented by input data 10. These are known as the 0-values of the possible actions, in accordance with conventional notation in the literature.

It can be seen that, if the neural network is trained successfully, then from the vector of 0-values an intelligent agent can infer the best action(s) to take in the state represented by input data 10. E.g. the best expected future reward may be obtained by changing the traffic signals at one or more junctions, in accordance with the best 0-values. Indeed, if the neural network once trained does not need to be trained further, which could be the case in embodiments where agents once deployed do not leam anything further (but may be replaced at some point by new agents which have learned "offline"), then the global value subnetwork 16 could be omitted from the deployed agent. This is because the best 0-value will be the same as the best advantage value in the output of the junction advantage subnetworks, the output of the global value subnetwork essentially being a fixed offset applied to all advantage values in a particular state.

The neural network is trained by collecting "memories", or "transitions", in training. When the network is being trained, it is presented with scenarios in the form of input values 10. The network then calculates value, advantage, and hence 0-values and makes a decision as to what action to take, i.e. which traffic signals at which junctions will be changed. Once that decision has been taken, it is applied to the junctions (by changing the relevant traffic signals) and the result is observed. The result of the action is the next state of the traffic system. As a result of the action, there will also be a reward value calculated. This is done according to a reward function which may be tailored (and changed from time to time) depending on what policy objectives are being pursued by those managing the traffic network. For example, the reward function may be biased to heavily penalise late-running buses, but take into account to a lesser extent private cars being held in queues.

Each transition is a (state, action, reward, next state) tuple.

Once memories / transitions have been collected, they are used to update the weights in the neural network. To learn from a transition, first the state is forward-propagated through the network. This results in the 0-vector on the output. It can be seen that the global value subnetwork 16, which is not essential when an agent incorporating the network is deployed to make control decisions, is nevertheless an essential part of the trainable network, since it is required to produce a full vector of 0-values on the output. Then, the element(s) of the 0-vector corresponding to the action of the transition are substituted. Since an action in the transition could involve a change of one, two, or more different signals at different junctions, one, two or more element(s) of the 0-vector may be substituted. The substituted element(s) are substituted for a new Q value calculated as: Onextstate is the expected value of the next state, and y is a "discount factor" between 0 and 1, typically around 0.95. The discount factor accounts for uncertainty around the 25 future reward.

The reward is directly obtained from the stored transition, having been calculated according to the reward function. This represents the immediate reward associated with the action taken. The expected value of the next state is obtained by forward-propagating the next state through the global value subnetwork and adjusting according to a discount function. The extent to which the expected value "looks into the future" can be tuned, and generally algorithms are weighted to put more emphasis on rewards which can be expected to be realised sooner. Options for ways of calculating the expected future reward will be known to the skilled person from the literature on 0-learning generally. To avoid overestimation which is a characteristic problem of 0-learning, double 0-learning may be used which again will be familiar in general to the skilled person. Double 0-learning involves using a different model (i.e. a different action selection policy) to calculate the expected value of the next state, from the current model. In this case, it has been found advantageous to use an earlier version of the neural network (i.e. a copy of an old version of the model, from before one or more updates were made to the weights). This reduces instability of the learning process, and makes it more likely that the model will consistently improve as it learns from new experiences. In one embodiment, the model used to calculate the expected value of the next state, which may be referred to as a "reference model", is held constant for n update steps. After n update steps, the reference model is replaced by the current model under training. This "snapshot" then remains in place as the reference model for a further n update steps.

In other embodiments, the reference model may be consistently m steps behind the current model, i.e. the reference model is replaced at every update step with an earlier version of the model from before it had the last m updates. In further possible embodiments, other sources of reference model could be used for an implementation of double 0-learning.

The 0-vector with the new elements substituted forms the "ground truth" from which a loss vector can be calculated. The loss can then be backpropagated through the network, to update the weights. In embodiments, a plurality of transitions may be forward propagated and substituted, and loss vectors calculated as described. A loss matrix is thereby created, wherein each column of the matrix corresponds to a loss vector arising from a single transition. From the loss matrix a scalar loss value can be determined which may then be used to update the weights in a single backpropagation.

The set of transitions used to derive a single loss matrix is referred to as a "minibatch".

According to known practice, when calculating the "expected future reward" future steps may be given decreasing weight, i.e. rewards expected to be realised sooner are worth more. A predetermined number of future steps may be included, with rewards 30 further into the future being discounted at an exponentially decaying rate.

Given that most transitions will contain actions which directly change the traffic signals in only a small minority of the junctions in the city, for a given transition many of the junction advantage subnetworks 14 will not have any 0-values associated with them substituted before the loss function is calculated. Even in some minibatches of transitions, it is possible in some embodiments that some junctions will not see an action and therefore will not have 0-values substituted. To avoid backpropagating zero errors, these subnetworks 14 simply do not have their weights updated at all. Only junction advantage subnetworks 14 associated with junctions which took part in the actions included in the minibatch of transitions have their weights updated.

In some embodiments, the gradient of all updates in shared layers (i.e. shared between multiple junctions) may be reduced by a factor. The factor may be chosen according to the number of junctions directly affected by the relevant actions in the minibatch.

In some embodiments, the loss matrix as described may be used to derive a single scalar loss for backpropagation. In other embodiments, the loss matrix may be sliced "horizontally", i.e. a set of rows corresponding to one junction may be treated as a loss matrix associated with that particular junction. A scalar loss may be calculated for that junction, and then backpropagated through the shared layers and the subnetwork associated with that junction. Hence there may be either a single backpropagation update for a single loss, or a per-junction backpropagation update for a per-junction loss.

Defining transitions A transition, as mentioned, is a (state, action, reward, next state) tuple. Conceptually, the "action" could involve changing the traffic signals at one, two, or more of the junctions in the city. An "action" could even be to do nothing at all. Indeed given that a timestep may be for example less than one second (600ms has been found to be a useful interval in one embodiment), in many fimesteps doing nothing may well be the best option, or even the only reasonably good option.

This leads to various difficulties in defining what a transition actually is. If the next state is simply defined as the state at the next timestep, then the very short time window is likely to hamper information from the reward function flowing into the updates, since the immediate rewards in every transition will be very small, reflecting that the state is not likely to change that much in a very short amount of time. This will result in slow or inadequate learning. Intuitively, a transition boundary could be defined when a positive action actually takes place i.e. "do nothing" actions do not count, and the transition will run from the timestep when the positive action -signal change -is decided upon to the timestep when the agent decides to take another positive action -another signal change. However, as the system scales to include more and more junctions, this strategy will tend to be about the same as simply defining a transition to be one timestep, since with enough junctions, at any particular timestep the traffic signals are probably changing at least somewhere in the city.

The problem is further complicated by the fact that in real embodiments, the agent is not likely to be in direct control of the traffic signals. This is in the sense that the traffic control agent's actions are to request that a particular signal changes to a particular stage at a particular time. This request is made to a traffic light controller at the relevant junction. The agent is essentially feeding into an external request input which is part of a reasonably standard traffic signal controller. The reason for this architecture is that traffic signals are safety critical systems and must be guaranteed to follow certain rules. Hence the traffic light controller will enforce various rules, for example, once the light is green it must remain green for at least a minimum period of time. A request which does not comply with these rules will just be ignored by the traffic light controller. In some embodiments it may be in theory possible for an agent to request an "illegal" action -although such actions are likely to be penalised in training, it is possible that one could still be requested. In other embodiments the agent is designed never to request an illegal action -the actions which the agent can choose from are masked to give the agent the option only to choose an action which will be accepted.

This "masking" of available actions may be achieved by extra layers which implement static logic, i.e. they are not in the "learnable" part of the neural network. If the agent is considered to include these extra layers then it is simply not capable of requesting an action which will not be accepted.

An "accepted action", which may be used to mark a transition boundary, may be 25 defined as any action which led directly to the requested traffic signal configuration. In some embodiments, the definition of "accepted action" may be extended to include actions which led to a via stage for the requested configuration.

An "accepted action" can be defined in some embodiments to include "do nothing" actions. In practice a "do nothing" action will usually be accepted by a traffic light controller, but possibly not always -a controller is likely to insist on a stage change after a maximum period has elapsed, and if the agent has not requested a stage change within that maximum period then the agent's request to "do nothing" will be overridden by the traffic light controller.

Likewise controllers will insist on remaining in a stage for a minimum period, and so after a stage change there will be a length of time in which the agent cannot affect the signals at all. After this minimum stage length the controller will accept a signal change action, but equally will accept a "do nothing" action until the maximum period has elapsed. Defining a transition as running from a positive signal change action up until a decision to remain at the current stage, at such time as the agent could change the stage if it wanted, is therefore a good option. This means counting "do nothing" actions as accepted actions, but only at timesteps where the controller would have accepted a different (positive change) action if one had been issued.

The time boundaries of transitions may be defined for example as: * One timestep is one transition -this may work in some embodiments but the per-transition immediate rewards are likely to be low magnitude, resulting in a slow flow of information from the reward function into the update stage; or * Every timestep with an accepted action at at least one junction represents the start of a transition -may work well for smaller embodiments, but tends to become in practice very similar to "one timestep is one transition" as the number of junctions in the city increases.

Where the transition is defined in this way, the "action" at the timestep may be defined to include all traffic signal actions the agent tried to make, at all junctions, irrespective of whether they were accepted by the relevant traffic light controllers. Alternatively, rejected actions could be replaced by the last-accepted action at the relevant junction. Where the agent includes masking layers, the problem of rejected actions does not arise. Or:

* Each accepted action marks the beginning of a transition, but the transitions are defined on a per-junction basis. Hence multiple transitions generated by the agent controlling traffic in the whole city may temporally overlap with each other. This strategy has been found to be effective and avoids the problems of the above in larger systems (i.e. embodiments with large numbers of junctions being controlled).

The reward associated with the transition is based on what happens in the traffic network after the action was made, according to the reward function which embodies policy objectives. The reward may be calculated as a result of what happens after the action until: * The next fimestep, or n timesteps into the future (for fixed n); or * The next action; * The desired traffic signal stage is reached (this may be multiple timesteps later, and may include the time taken to go through a via stage for the desired stage); or * The desired traffic signal stage, or a via stage for the desired stage, is reached (this includes only delays involved in a single transition to a new stage; this may still be multiple timesteps, for example to allow an amber signal to show for the requisite period of time).

The choice of time period used to calculate the reward does not have to be the same as the time period over which the transition occurs. However, care must be taken since the expected future reward of the next state of the transition is taken into account when calculating the loss vector. Therefore a reward period which lasts longer than the transition may result in double-counting of expected rewards and overestimation. This may not be a serious problem as long as the discount factor is chosen well.

A choice also needs to be made as to what geographical area is taken into account when calculating the immediate reward of the action. This could be: * The whole of the city; * All parts of the city between a junction and its neighbouring junctions; * As above, but no further than x metres from the junction; * Only the parts of the network which the real world sensors observe and which are between a junction and its neighbours; * Mainly the area surrounding a junction, but also including some component of reward for the wider region; this includes "weighted reward" schemes whereby the whole of the city may potentially be taken into account to some extent, but parts of the city closer to the junction associated with the action will influence the reward calculation more, and distant parts of the city will have relatively little impact.

The reward function may be defined according to policy objectives. A typical reward function will seek to reward traffic travelling at reasonable speeds (but perhaps penalise dangerous speeding), penalise stopped time, especially long stop times which may lead to frustration, and penalise stop-start driving which is liable to lead to high levels of toxic emissions. Further examples of factors which may be taken into account in the reward function may be found in the applicant's previous application W02020225523.

Creating and storinq transitions Transitions are created when an agent is allowed to choose actions, i.e. control traffic lights in a city. In most embodiments, this is done primarily in a simulation. Further details as to how agents may generate transitions! memories in a simulation is found in W02020225523, which is incorporated herein by reference. In some embodiments, transitions may also be saved when the agent is deployed to control traffic signals in the real world, i.e. in a live traffic control system. This may be done irrespective of whether the agent in the live traffic control system goes through update stages.

The agent, when running (especially in a simulation), is allowed to explore the policy space. In other words, the agent does not always take its currently best-predicted action. At least sometimes, the agent may "explore" by taking an action which it does not predict will be the best action, in order to "find out what happens" and hence learn new information. However, taking completely random (and often very bad) actions is unlikely to result in good performance, since the traffic in the simulation will then be very badly controlled, and the states of the simulated road network will therefore be (hopefully) unrealistic. Therefore there needs to be a balance, when the agent is learning transitions in a simulation, between exploration and exploitation. The goal is for the agent to gain new knowledge, while still controlling traffic reasonably well.

In one embodiment, an episodic approach to training is used. I.e. an agent will be allowed to run in a particular simulation with particular starting conditions, until that simulation finishes after a number of time steps. The length of an episode may be for example about 20 minutes in "real time" (but since the training is in a simulation, the training may take place faster than that). Of course, multiple copies of the agent may run in multiple copies of the simulation, in parallel. In a particular episode, exploration, i.e. the potential for the agent to choose an action other than its predicted best action, may be enabled only for one junction. Good results have been found by cycling through the junctions episode-by-episode. In other embodiments, the junctions can be cycled through one decision point at a time. In yet other embodiments, a junction may be chosen at random for exploration at each decision point.

Within this category of approaches, an "exploration temperature" c may be defined. VVhere exploration is allowed, the chance of taking a random action is given by E. Otherwise, the (expected) best action is chosen -with probability (1 -E). The value of E may be decayed, for example exponentially, during the training episode. Therefore as the episode progresses the actions of the agent become in general less random and more likely to be the expected best action.

In other embodiments, Boltzmann exploration may be used: each time the agent is asked to make a decision, it uses the relative difference in expected reward between the available options, along with another temperature parameter, to choose a weighted random action. In other words, the agent could take any action but will be more likely to take better actions. At the beginning of the episode the agent may be for example twice as likely to choose to take an action which is predicted to be twice as good. The temperature parameter, which is increased throughout the training episode, means that it is even more likely to choose the "best" action towards the end of the episode.

In yet further embodiments, a "noisy nets" approach may be used. This involves applying noise functions to the parameters of the neural network during the action selection phase of the training. Hence the predicted Q values have a random aspect. The agent then picks the action relating to the "best" noisy Q value, but that may not always be the same as the action which would have had the best 0 value had noise not been applied. The noise functions consist of parameters -for example a gaussian noise function has mean and variance parameters. These parameters themselves are part of the neural network model and hence are included in the backpropagation in the network update phase. In theory the amount of noise will naturally reduce over update iterations.

In some embodiments, an episode may be terminated if particular conditions are met. The conditions for termination are chosen to indicate essentially that the traffic conditions have become too bad -e.g. congestion is too great. This may happen quite often at the early stages of training, before the agent has really started to learn a good strategy. Although the agent needs to be trained in "difficult" as well as in "easy" scenarios, generally better information will be yielded where agents are trained in scenarios in which they can broadly "succeed". Transitions yielding useful information will be generated from an already fairly well-trained agent in a difficult scenario, but less useful information is embodied in transitions from an untrained agent making essentially random decisions in a scenario where the traffic situation has already been allowed to become hopeless. Hence early termination of episodes helps to improve the efficiency of the training process.

In some embodiments, agents are not allowed to take actions at every stage of the simulation. For example, the simulation may be stepped for example 3 or 4 fimesteps before the agent is invited to choose an action. This is found to have beneficial effects. It reduces the number of "stage extension" / "do nothing" actions in the transition database, creating a more balanced set of memories for the agent to learn from.

As transitions are generated, they are saved. The transitions may be saved in a database or any other suitable data structure. Transitions may then be sampled from the database in update stages. Generation of the transitions and use of the transitions for updating is asynchronous, i.e. transitions do not have to be used in the order that they are generated, some transitions may never be used at all, and the transitions being used to update an agent were not necessarily generated by the latest version of the agent.

It is commonplace in reinforcement learning generally to continually discard the oldest transitions. However, in embodiments, options include: * Storing every transition in a suitable database so that the amount of data keeps growing; * Keeping only the n most recent transitions, discarding transitions older than that, in this way the database acts as a buffer; * Randomly delete transitions to maintain the database at roughly constant size, in this way the average age of transitions in the database increases; or * Prioritise transitions based on some evaluation of their potential merit, and remove the least valuable transitions. The merit of a transition within the database can be evaluated in various ways. For a particular model, a transition with a low magnitude loss vector will not result in much learning and therefore may be considered low value. However, it does not necessarily follow that the transition will always be of low value in this sense, for future versions of the model. Model-independent measures of the value of transitions include attaching large value to transitions with (state, next state) pairs which are unusual -intuitively these transitions may relate to actions which have had an unexpected / surprising result and therefore contain new information from which the model can learn. Weighting transitions by total reward is also an option -large positive or large negative rewards associated with transitions indicate transitions which have had a large good or bad effect, and therefore contain useful information. Preferably, the merit of a particular transition is assessed in the context of its role in the whole database of transitions -in particular it is likely to be desirable to maintain a diverse set of transitions in the database.

Transitions are used in "batches" and "minibatches". A batch is defined as a set of transitions in the database which is used to update the agent from one version which was used to create transitions in the simulation, to a new version which is used to create more transitions in the simulation. A batch update may involve a plurality of loss matrices being calculated, and a loss being backpropagated for each (in some embodiments, there are multiple backpropagafions per loss matrix, where the matrix is horizontally sliced and a scalar loss calculated per-junction). Hence a "minibatch" is a subset of a "batch" and is the set of transitions which is used to construct a single loss matrix.

For a particular update, a batch of transitions is sampled from the transition database. The transitions could be sampled randomly, with or without replacement in different embodiments. Alternatively, the sampling could be prioritised based on evaluation of expected merit. Again this prioritised sampling could be done with or without replacement. Evaluations of merit at this stage may be done according to the current model (for example, prioritising transitions which will result in high magnitude loss vectors) or using model-independent measures as discussed above.

In an example embodiment, a batch may contain for example 5000-10000 transitions, and is split into multiple minibatches each containing anything from a single transition to a few hundred. Some splitting of batches into minibatches in this way is found to be preferable, but in other embodiments at the extremes, a batch could contain just one minibatch, and hence a single loss matrix will be calculated per update, or a batch could contain as many minibatches as there are transitions, i.e. one transition per minibatch. In that case a loss matrix (of one column) would be generated for every transition.

The embodiments described above are provided by way of example only, and various changes and modifications will be apparent to persons skilled in the art without 5 departing from the scope of the present invention as defined by the appended claims.

Claims

CLAIMS1. A traffic control system for use in a controlling a road network comprising multiple junctions, the traffic control system comprising: a plurality of sensors for monitoring vehicles and/or other road users at and around each junction; a traffic control agent subsystem; and traffic signals including signal outputs for controlling the vehicles and/or other road users at each junction, the sensors providing inputs to the traffic control agent subsystem, and the traffic control agent subsystem controlling the traffic signals to optimise traffic flow in accordance with one or more goals, in which the traffic control agent subsystem includes a machine learning agent trained by reinforcement learning, the machine learning agent comprising a neural network including: a shared state embedding subnetwork comprising an input layer, one or more connected hidden layers and a shared state output layer; a global value subnetwork comprising an input layer connected to the shared state output layer, one or more connected hidden layers, and a global value output layer, for each junction, an advantage subnetwork comprising an input layer connected to the shared state output layer, a plurality of connected hidden layers, and a junction advantage output layer; an aggregation layer for combining the global value output layer with each of the junction advantage layers to output a junction value layer.
2. A traffic control system as claimed in claim 1, further comprising a road network simulation model, the machine learning agent being trained by reinforcement learning by making traffic control decisions in the road network simulation model.
3. A traffic control system as claimed in claim 2, in which a plurality of different agents are trained in the road network simulation model, and in which the performance of trained agents is measured, and in which the machine learning agent in the traffic control agent subsystem is replaced with a better performing agent, when a better performing agent becomes available.
4. A traffic control system as claimed in claim 2 or claim 3, in which the road network simulation model runs faster than real time.
5. A traffic control system as claimed in any of claims 2 to 4, in which multiple copies of the same agent are trained in multiple copies of the road network simulation model, in parallel.
6. A traffic control system as claimed in any of the preceding claims in which an agent being trained takes either an exploitation action according to its current learned strategy, or an exploration action which includes a random aspect, according to a decision process in which the possibility of an exploration action is enabled for only one junction at a time.
7. A traffic control system as claimed in claim 6, in which the junction for which the possibility of an exploration action is enabled is changed each time the agent is able to make a decision.
8. A traffic control system as claimed in claim 6, in which the junction for which the possibility of an exploration action is enabled remains the same for the duration of an episode during which the traffic control agent is running in a particular simulation scenario.
9. A traffic control system as claimed in any of the preceding claims, in which the machine learning agent is trained by Q-learning.
10. A traffic control system as claimed in claim 9, in which a ground truth is calculated during updating of the machine learning agent by estimating the expected future reward associated with an observed action.
11.A traffic control system as claimed in claim 10, in which the expected future rewards is estimated using a bootstrapping technique, using an earlier version of the machine learning agent.
12. A traffic control system as claimed in any of the preceding claims, wherein training the machine learning agent includes maintaining a database of transitions, and sampling a subset of the transitions for use in updating the agent.
13. A method of controlling traffic signals at a plurality of junctions in a city, using a plurality of sensors for monitoring vehicles and/or other road users at and around each junction, the method comprising: using a traffic control agent subsystem to send traffic signal stage requests to traffic signal controllers controlling the traffic signals at each junction, the traffic control agent subsystem receiving input data from the sensors, and the traffic control agent subsystem including a machine learning agent trained by reinforcement learning, the machine learning agent comprising a neural network including: a shared state embedding subnetwork comprising an input layer, one or more connected hidden layers and a shared state output layer; a global value subnetwork comprising an input layer connected to the shared state output layer, one or more connected hidden layers, and a global value output layer; for each junction, an advantage subnetwork comprising an input layer connected to the shared state output layer, a plurality of connected hidden layers, and a junction advantage output layer; an aggregation layer for combining the global value output layer with each of the junction advantage layers to output a junction value layer.
14. A method as claimed in claim 13, including training the machine learning agent by reinforcement learning by use of a road network simulation model, the machine learning agent being updated according to actions taken and observed results and rewards in the simulation model.
15. A method as claimed in claim 14, including training a plurality of machine learning agents by reinforcement learning by use of a road network simulation model, each machine learning agent being updated according to actions taken and observed results and rewards in the simulation model, measuring the performance of each of the plurality of trained agents, and selecting the best performing agent for deployment in the traffic control agent subsystem to control the junctions in the city.
16. A method as claimed in claim 15, including training the plurality of machine learning agents in parallel with each other.
17. A method as claimed in any of claims 13 to 16, in which an agent being trained takes either an exploitation action according to its current learned strategy, or an exploration action which includes a random aspect, according to a decision process in which the possibility of an exploration action is enabled for only one junction at a time.
18. A method as claimed in claim 17, in which the junction for which the possibility of an exploration action is enabled is changed each time the agent is able to make a decision.
19. A method as claimed in any of claims 13 to 18, including training the machine learning agent using a 0-learning algorithm.
20. A method as claimed in claim 19, including calculating a ground truth during updating of the machine learning agent by estimating the expected future reward associated with an observed action
21. A method as claimed in claim 20, including estimating the expected future reward associated with an observed action by using a bootstrapping technique using an earlier version of the machine learning agent.
22.A method as claimed in any of claims 13 to 21, including maintaining a database of transitions and sampling a subset of the transitions for use in updating the agent.
23. A computer readable medium containing instructions which, when executed on a processor, implement a machine learning agent for use in the method of any of claims 13 to 22.
24. Hardware adapted to implement a machine learning agent for use in the method of any of claims 13 to 22.