CN111967712B

CN111967712B - Traffic risk prediction method based on complex network theory

Info

Publication number: CN111967712B
Application number: CN202010649490.3A
Authority: CN
Inventors: 李大庆; 郑参
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-07-08
Filing date: 2020-07-08
Publication date: 2023-04-07
Anticipated expiration: 2040-07-08
Also published as: CN111967712A

Abstract

The invention provides a traffic risk prediction method based on a complex network theory, which comprises the following steps: step A: establishing a double-layer traffic network model based on empirical data division grids; and B: extracting and screening features based on a complex network theory; and C: risk prediction is carried out based on an ensemble learning theory; step D: evaluating and verifying the model; through the steps, two dimensions of the function and the structure of the traffic system are comprehensively considered, scientific and reliable technical support and theoretical support are provided for the identification of traffic risks, and important support is provided for risk diagnosis of the traffic system, formulation of targeted management control measures and improvement of traffic operation reliability; the method has the advantages of strong systematicness, high portability and easy operation, and solves the problem that risks in a complex traffic system are difficult to identify and predict.

Description

Traffic risk prediction method based on complex network theory

Technical Field

The invention provides a traffic risk prediction method based on a complex network theory, and relates to the technical fields of risk analysis, network science and the like.

Background

Risk refers to a possible occurrence of an event that, if occurring, can impede the development of the system, even go to death, and is also defined as the uncertainty of whether an event occurred or not. The risk exists in the system objectively, and the loss caused by the risk can be prevented or reduced by adopting a precautionary measure, but the risk cannot be eliminated. In a complex system, because risks in the system often appear in the characteristics of sudden occurrence, large spread range and strong destructive power, great difficulty is brought to the identification, prediction and prevention of system risks, new challenges are also provided to the research of risk management, control and prevention of the complex system, and the loss caused by the occurrence of the system risks can bring great influence to the life of people and even the operation of the society, so that the accurate prediction of the risks in the complex system by adopting a scientific and reasonable method is necessary. The traffic system plays an important role in the aspects of travel, urban operation and the like, and in recent years, with the rapid development of mobile interconnection and vehicle-mounted technology, the traffic system has the characteristic of high complexity in structure and function. Under the complex and changeable environment and demand, the traffic system can face the occurrence of artificial and natural risk conditions such as traffic accidents, construction closure, rainstorm, snow disasters and the like, the traffic risk events often cause traffic jam, and meanwhile, the traffic system has the characteristic of space-time evolution, and the risk events can be spread in the traffic system after the occurrence of the traffic risk events, so that a large amount of extra cost is added for the travel of residents, and great resource waste is brought to the society.

In the current research of risk identification and prediction of a traffic system, the main methods include a Model-based analysis method, qualitative analysis and quantitative analysis, particularly, the structure and the function of the system are described based on a Process Flow Diagram (PFD) and grey correlation analysis, and the risk is identified and predicted by analyzing the system deviation generation condition and the correlation degree among all influencing factors and quantifying the system deviation generation condition and the correlation degree among all influencing factors; in addition, with the advent of the big data age and the development of technology thereof, knowledge-based analysis methods have been developed, and the main methods thereof include a causal relationship model, a machine learning model, a deep learning model, and the like, which are based on empirical data generated by a traffic system, such as: and (3) traffic flow, vehicle-mounted speed and the like, and an unknown relation and a pattern in the data are discovered and revealed by constructing a historical data set application model, so that the risk state in a traffic system is identified and predicted. The method only uses the known model and data to predict the risks of the traffic system from the state of the traffic system, does not dynamically consider the incidence relation and the evolution mode among the risks in the traffic system from the network level, and is difficult to explain the internal mechanism of the risk formation of the traffic system. Therefore, aiming at the traffic system with high structural and functional complexity, the invention combines the complex network theory and the machine learning method to identify and predict the risk of the traffic system, provides a new perspective and a new method for researching the risk identification prediction and management control in the traffic system, enriches the cognition of people on the risk in the traffic system, and has important significance for ensuring the healthy and stable operation of the traffic system.

Disclosure of Invention

Objects of the invention

The invention is mainly used for solving the problem of risk identification and prediction under the background of a complex system and a network structure, the conventional method mainly analyzes the risk of a traffic system from the function of the system, and the invention provides a traffic risk prediction method based on a complex network theory by comprehensively considering two dimensions of the function and the structure of the traffic system from the perspective of the complex network aiming at the high complexity and the time-space evolution characteristic of the traffic system and the problem that the conventional method cannot well identify and predict the risk of the traffic system. The method provided by the invention can effectively identify and predict the risks of the traffic system, and provides important support for risk diagnosis of the traffic system, formulation of targeted management control measures and improvement of traffic operation reliability.

(II) technical scheme

In order to achieve the purpose, the method adopts the technical scheme that: a traffic risk prediction method based on a complex network theory is provided.

The invention relates to a traffic risk prediction method based on a complex network theory, which comprises the following steps:

step A: dividing grids based on empirical data to construct a double-layer traffic network model;

and B: extracting and screening features based on a complex network theory;

and C: risk prediction is carried out based on an ensemble learning theory;

step D: and (5) evaluating and verifying the model.

Through the steps, the purpose of risk prediction of the traffic system can be achieved, the method is strong in systematicness, high in transportability and easy to operate, and the problem that risks in a complex traffic system are difficult to identify and predict is solved.

The step A of establishing the double-layer traffic network model based on the empirical data division grids comprises the following steps of: firstly, acquiring basic information of roads in a research area, wherein the basic information mainly comprises two parts, namely traffic network road information and longitude and latitude information of a traffic road intersection, dividing the basic information into N-M grid areas according to the area and the size of a research area range and the longitude and latitude information of road sections and intersections, and labeling the grid areas; secondly, aiming at each grid area, constructing a grid traffic jam network model by using a complex network theory and a method according to actual traffic data, intersection as a node, road section as an edge and relative speed of the road section as an edge weight in a grid on a microscopic level; on a macroscopic level, each grid area is used as a node, whether congestion roads exist between grids is used as a judgment bar for judging whether edges are connected or not, the number of the congestion roads existing between the grids is used as an edge weight, and a grid node traffic network model is constructed by applying a complex network theory and a method; the specific method comprises the following steps:

step A1: dividing grid areas based on geographic information;

step A2: preprocessing speed data to obtain relative speed matrix

Step A3: construction of grid traffic congestion network model G ₁ (N ₁ ,L ₁ )；

Step A4: construction of mesh node traffic network model G ₂ (N ₂ ,L ₂ )；

The step A1 of "dividing the grid area based on the geographic information" specifically includes the following steps: firstly, extracting traffic network models and traffic road information required by dividing grid areas from a geographic information system (Mapinfo) file by using programming software Python, wherein the extracted information mainly comprises vehicle-mounted speed of each road at each moment, longitude and latitude information of intersections, network topological structure information of a researched traffic system and the like, and in the process of extracting the longitude and latitude of the intersections, the invention uses Python to call a Baidu map Application Programming Interface (API) and adopts a sequential traversal method to obtain the longitude and latitude information of the intersections by matching the topological structure of a road network with the names of the intersections, and processes the road with failed longitude and latitude acquisition due to the difference of the names of the road intersections on the Baidu map and the Mapinfo to obtain an accurate and standard longitude and latitude information data set of the traffic system road network; secondly, calculating the area S and the latitude and longitude dereferencing range of the researched area according to the obtained traffic road information of the researched area and the longitude and latitude information of the intersection, and scientifically and reasonably determining the number of the divided grids to be N × M according to the actual background condition of the researched area, so that the area of each grid is S/(N × M); finally, according to the divided grid areas, counting which intersections are in the grid according to the longitude and latitude information of each intersection in the traffic network and recording;

wherein, the speed data preprocessing described in the step A2 is used for obtaining a relative speed matrix

", it is as follows: in this step, first, actual traffic operation data of a vehicle-mounted Global Positioning System (GPS) is acquired at an arbitrary timing t _i Expressing the speeds corresponding to all R roads into a vector form V according to the sequence relation of the roads _i ＝(v ₁ ,v ₂ ,…,v _R ) (ii) a Further, the above process is repeated for all T moments, and finally the velocity vectors V at all moments are integrated _i Generating an initial speed matrix>

Secondly, in the process of collecting the speed information of the traffic system by using the floating car technology, the speed information of each area at each moment cannot be ensured due to the influence of the network communication technology and human and natural factorsThe complete collection remains, so that the invention requires a speed compensation of the original speed information of the traffic system, i.e. the original speed matrix ≥ is>

In which there is a partial absence value (actually recorded as 0), so that a search for a speed matrix &isrequired>

The velocity missing value in (1), i.e. the element with the value of 0 in the matrix, is subjected to velocity compensation; for t _i Time-lapse road R _j Is first found in the road network G (N, L) for the road R _j Is selected based on the set of neighbor roads>

Searching whether the speed record exists in the road in the set at the moment, and if one element in the set has the speed record, taking the average value of the elements in the set, wherein the specific formula is as follows:

in the above-mentioned formula, the compound has the following structure,

road R indicating lack of speed _j At t _i A speed compensation value at a time instant>

Road R indicating lack of speed _j Is selected based on the set of neighbor roads>

Is not the sum of 0 element values, J represents the speed-missing road R _j Is selected based on the set of neighbor roads>

The number of elements other than 0;

if the road R is _j All the neighboring road speeds are not recorded, the road R is determined _j Is compensated to 0, the original velocity matrix is used after each compensation

Updated to compensated>

Repeating the above process at each moment until all 0 values in the velocity matrix are compensated to obtain a completed velocity matrix->

In the original absolute velocity matrix

After the road speed compensation is completed, because the road grades at all levels are different, normalization processing is carried out on the compensated speed matrix to obtain the relative speed of the compensated speed matrix, and the judgment standard is unified; for any road R _j Based on the speed matrix->

In which a speed vector at all times of the link is taken>

And extracts the maximum speed limit of the road section

The speed vector at this moment is combined>

Is divided by the maximum speed limit->

To obtain a normalized velocity

Get the normalized speed matrix->

As follows:

wherein, the step A3 of building the grid traffic jam network model G ₁ (N ₁ ,L ₁ ) ", it is as follows: for each grid area divided in the step A, firstly, according to actual map data under each grid area, using software tools such as Python, mapinfo and the like to extract structure information among roads and road intersection information contained in each grid area; secondly, selecting a suitable geographical coverage area of traffic according to the requirement of actual research, such as selecting a five-ring traffic network in Beijing; then, according to a complex network method, abstracting a road intersection in each grid area as a node in the network, abstracting a road in the traffic network of the grid area as a connecting edge between nodes in the network, and taking the relative speed of each road as the weight of the connecting edge so as to establish a grid traffic congestion network in each grid area; meanwhile, most roads of the traffic network run in two directions and have directionality, so the grid traffic jam network constructed by the invention is a directed weighting network;

wherein, the step A4 of building the grid node traffic network model G ₂ (N ₂ ,L ₂ ) ", it is as follows: firstly, constructing an intersection traffic network model between grids according to intersection information contained in each grid area and road topological structure information of a traffic network (whole network) of the whole research area, namely deleting the road topological structure information contained in the grid area on the basis of the whole network; secondly, counting congestion existing between grid areasThe number of roads is recorded; and finally, abstracting a grid area into nodes, abstracting whether congestion roads exist between grids as connecting edges or not by applying a complex network theory and a complex network method according to the information, and establishing a grid node traffic network model by taking the number of the congestion roads between the grids as connecting edge weights.

Wherein, the method for extracting and screening the features based on the complex network theory in the step B comprises the following steps: for each time t _i The grid traffic jam network and the grid node traffic network (referred to as a double-layer traffic network for short) set a seepage threshold q (t) to carry out seepage analysis, and determine the seepage threshold q (t) through the seepage analysis of the double-layer traffic network; secondly, aiming at each grid traffic jam network and nodes (grids) in the grid node traffic network under the seepage threshold q (t) at each moment, extracting the characteristics of each grid area by using the theory and method of a complex network, wherein the characteristics comprise the structural and functional characteristics such as maximum jam sub-cluster, node betweenness mean value, node degree mean value, average speed of the grid jam network, the number of first-order neighbor congested roads and the like, screening the extracted characteristics by using a machine learning method on the basis, selecting the characteristics which greatly contribute to the traffic risk identification and prediction effect, constructing a high-quality sample characteristic set, and improving the traffic risk identification and prediction effect and efficiency to the maximum extent; meanwhile, labeling the grid area at the time t according to the proportion of the congested road at the time t + delta t in each grid area; the specific steps of the process are as follows:

step B1: analyzing seepage of a traffic network;

and step B2: extracting risk characteristics based on a complex network;

and step B3: screening risk characteristics based on machine learning;

wherein, the seepage analysis of the traffic network in the step B1 is specifically performed as follows: the seepage theory is applied to carry out seepage analysis on the double-layer traffic network,firstly, a control variable, namely a seepage threshold value is given to a traffic network at each moment, and is set as q (t), so that each road in the traffic network can present two states: unblocked state (i.e. v) _{i_ratio} (t) > q (t)) and congestion status (i.e., v _{i_ratio} (t) q (t)); deleting the unblocked connecting edges in the traffic network from the original network, and keeping the congested connecting edges in the original traffic network, wherein the rest network is the traffic network in a congested state at the moment t, and is referred to as a congested network for short; the next q (t) value corresponds to a congestion network at each moment, and as the q (t) value is reduced, the congestion of the traffic network becomes higher, namely, the number of invalid edges is increased, the traffic network becomes more sparse, so that the traffic congestion risk at the current moment is identified and predicted when a proper seepage threshold q (t), namely the urban traffic network is in a stage with the most abundant congestion information, is selected;

wherein, the "risk feature extraction based on complex network" in step B2 is specifically performed as follows: in the step, a grid traffic jam network and a grid node traffic network are constructed for each moment under a seepage threshold q (t), and from the viewpoint of statistical physics, a complex network theory and a complex network method are used for preliminarily extracting micro and macro characteristics of a grid area of a double-layer traffic network at each moment from the two viewpoints of structure and function; firstly, on a microscopic level, each grid traffic congestion network is used as a research object, and the microscopic features of each grid area are calculated at the key seepage threshold value at each moment; the grid traffic congestion network has different characteristics at different moments, and the congestion network in the grid area can show dynamic characteristics in space along with the evolution of time, so that the grid traffic congestion network has a spatio-temporal characteristic; secondly, on a macro level, aiming at the constructed grid node traffic network model, taking the nodes (grid areas) thereof as research objects, and calculating the macro features of the grid areas (nodes) at each moment, as shown in fig. 2, such as the micro features: the maximum congestion subgroup, the mean value of node betweenness, the mean value of node degree, the mean value of aggregation coefficient, the average speed and the growth rate of the congestion network of the grid traffic congestion network, and the like, wherein the macro characteristics are as follows: the node average path length, the node strength, the node betweenness, the node degree, the growth rate and the like of the grid node traffic network;

in the invention, a method is provided for extracting features from the perspective of a complex network, the feature extraction of a grid is exemplified, and the features of an actual traffic system can be preliminarily extracted in a targeted manner from two aspects of the structure and the function of the actual traffic system according to the actual background and the actual situation of the actual traffic system, so that a sample feature set is constructed, and an initial feature matrix M is constructed _f ；

The step B3 of "risk feature screening based on machine learning" specifically includes the following steps: in step B2, extracting the features of the function and the structure of the grid area at each moment based on the related knowledge of the complex network, and then constructing an initial feature matrix M _f In order to improve the accuracy and precision of risk identification and prediction in the traffic system, a relevant theoretical method of machine learning is used for carrying out feature selection on the preliminarily constructed sample feature set in the step, so that a high-quality sample feature set is screened out, and the effect of risk identification and prediction in the traffic system is improved to the greatest extent; meanwhile, the structure and function characteristics of the traffic system are screened, important characteristics are screened out, irrelevant characteristics are removed, dimension disasters can be relieved, the difficulty of learning tasks is reduced, and the generalization capability of an over-fitting enhanced machine learning model is reduced; aiming at the characteristic that a traffic system has high complexity of space-time evolution and in order to optimize a given learner, the invention uses a relatively classical LVW (Las Vegas Wrapper) method in a wrapping mode to select characteristics, as shown in figure 3, and the specific steps are as follows:

(1) Setting an initial optimal error E to be infinite, setting the current optimal feature subset to be an attribute complete set A, and setting the repetition times t =0;

(2) Randomly generating a group of feature subsets A ', and calculating the error E' of the classifier when the feature subsets are used;

(3) If E ' is smaller than E, enabling A ' = A and E ' = E, repeating the steps (2) and (3), otherwise, T + +, and jumping out of the cycle when T is larger than or equal to the stop control parameter T;

LVM method in calculation processThen the performance of the final used learning device is taken as the evaluation criterion of the characteristic subset, the characteristic subset which is most beneficial to the performance of the given learning device and is customized is selected for the given learning device, the high-quality sample characteristic set is screened out, and the characteristic matrix is constructed

Wherein, the step C of "risk identification and prediction based on ensemble learning theory" includes the following steps: in order to accurately identify and predict the congestion risk in the traffic system and effectively control the congestion risk, the method comprises the steps of firstly constructing an integrated learning model by using machine learning and relevant mathematical knowledge; secondly, in order to eliminate the influence of non-uniform dimension among the feature vectors on the model, a feature scaling method is used for data feature set

Performing standardization processing to obtain a standard sample characteristic matrix>

Finally, in order to ensure that the model learns as much as possible the knowledge of the characteristics of the risks in the traffic system, the standard sample characteristic matrix ≥ is selected>

Dividing the data into a training set and a test set according to a certain proportion (a: b), training an ensemble learning model by using the training set data, and then identifying and predicting risks in a grid area of the traffic system at the current moment by using the trained ensemble learning model; the specific steps of the process are as follows:

step C1: constructing an ensemble learning model;

and step C2: risk identification and prediction are carried out by applying an ensemble learning model;

wherein, the "building integrated learning model" in the step C1 is specifically performed as follows: the invention aims to learn a more stable and better-performance model by using risk historical data information of a traffic system, the integrated learning model is more prominent in learning compared with a single classifier model, and in order to make up for the defect of learning of the single classifier model, the integrated learning theory is introduced in the invention, and the integrated learning model is constructed to carry out risk identification and prediction on the traffic system; the ensemble learning is to combine a plurality of weak supervision models to obtain a better and more comprehensive strong supervision model, and the potential core idea is that even if a certain weak classifier obtains wrong prediction, other weak classifiers can correct the errors, the current mainstream ensemble learning framework comprises Bagging, boosting and Stacking, the invention uses the Bagging framework and the associated theoretical method of ensemble learning to construct a random forest model to identify and predict the risk of the traffic system, as shown in fig. 4, the implementation steps are as follows:

(1) Suppose there is a data set D = { x = _i1 ,x _i2 ,…,x _in ,y _i }(i∈[1,m]) With a characteristic number N, with a sample generation sampling space (m x N) put back ^m*n ；

(2) Building a base learner (decision tree): for each sample d _j ＝{x _i1 ,x _i2 ,…,x _ik ,y _i }(i∈[1,m]) (where K < M) generating decision trees and recording the result h of each decision tree _j (x)；

(3) Train T times of

Where φ (x), is a mathematical model having: absolute majority voting, relative majority voting, weighted majority voting, and the like;

a special binary classifier, namely a random forest model, is constructed through the processes, risks in the traffic system are identified and predicted, in the process, the classification function is a symbolic function, output values are 0 and 1, and low risks and high risks in a grid area are respectively represented as follows:

in the above formula, f (x) _i ) Representing the risk status of the ith grid area, 0 representing low risk and 1 representing high risk;

meanwhile, an ensemble learning model is constructed by applying an ensemble learning theory to identify and predict risks of the traffic system, and a proper ensemble learning framework and model can be selected according to the distribution characteristics of data samples to identify and predict the risks, so that the risk identification and prediction effects of the traffic system are further improved;

in step C2, the risk identification and prediction by using the ensemble learning model is specifically performed as follows: in this step, based on the feature set of the high-quality sample extracted and screened in the step C, i.e. the feature matrix

Identifying and predicting risks in the traffic system by using the ensemble learning model constructed in the step C1; because the difference between characteristic dimensions in the historical sample data set can affect the performance of the ensemble learning model, when the model is used for risk identification and prediction, firstly, the sample characteristic set of a research object needs to be subjected to characteristic scaling, the influence of different dimensions between characteristic vectors on the model precision is eliminated, the convergence speed of the model is improved, and a standard sample characteristic matrix (Liang) is obtained>

The mainstream feature Scaling method in machine learning mainly comprises min-max normalization, mean normalization, and Scaling to unit length, wherein the method is used for sample feature set(s) of a traffic system>

The mainstream method for scaling the characteristics can select a proper characteristic scaling method according to the condition of the actual traffic system, the characteristics of the data characteristic set and the characteristics of the applied machine learning method in actual application, thereby ensuring the risk identification and prediction in the traffic systemMaximum accuracy and precision of; />

After scaling the characteristics of the sample data set in the traffic system, in this step, the standard sample characteristic matrix based on the traffic system

C, identifying and predicting risks in the traffic system by using the ensemble learning model constructed in the step C, and learning the characteristics of the ensemble learning model needing to learn the risks in the process, so that the standard sample characteristic set (the characteristic set) is combined in the method>

Randomly dividing the training set into a training set and a test set according to a certain proportion (a: b), wherein the training set is used for training the random Sen-wheel model to learn the characteristics of risks to the maximum extent, and the test set is used for testing the training effect of the model.

Wherein, the model evaluation and verification in step D is performed as follows: in the process of identifying and predicting the risk in the traffic system by using the ensemble learning model constructed in the step C, in order to accurately and scientifically evaluate the performance of the model, in the step, firstly, evaluation indexes are reasonably selected based on the actual traffic system condition and the final target of the invention, for example: accuracy, precision, recall, F1 value, etc., the nature of which is calculated from the Confusion Matrix (fusion Matrix); secondly, in order to prevent the model from being over-fitted and accurately evaluate the generalization ability of the model, the ensemble learning model is evaluated by using a cross validation method in the step, so that the scientificity and reliability of model evaluation are further improved; the method specifically comprises the following substeps:

step D1: selecting a model evaluation index;

step D2: evaluating and analyzing the model;

the specific method of selecting the model evaluation index in step D1 is as follows: the invention is directed at the risk in the traffic system to discern and predict, its final goal is to employ the integrated learning model to discern the risk in the traffic system accurately and scientifically, its essence belongs to the abnormal detection problem in the machine learning, the main characteristic is to have the unbalanced problem of data classification, namely the sample size of the normal data is large and the sample size of the risk data is small, therefore, it can't reflect the model performance quality objectively to use the rate of accuracy alone; according to the invention, the risk identification detection problem is faced in a scene, under the scene, the model is evaluated by adopting two evaluation indexes of recall rate and accuracy, and the formula is as follows:

in the formula, accuracy represents Accuracy, recall represents recall, and TP is the number of correct predicted cases; TN is the number of correctly predicted negative cases, FP is the number of predicted positive cases, FN is the number of predicted negative cases;

the prediction error condition of the real risk unit in the traffic system is better, because if the real congestion risk in the traffic system is not identified, the traffic system is damaged to a great extent once the real congestion risk occurs, and therefore, the recall rate needs to be concerned more; meanwhile, in order to ensure that the normal accurate prediction is normal, reduce the error rate of the normal sample prediction and enable a manager of the traffic system to accurately manage and control the real risk in the traffic system to the maximum extent under the limited resource cost, the accuracy and the recall rate are introduced as the evaluation indexes of the model;

the "evaluation and analysis of the model" described in step D2 is specifically performed as follows: in the step, in order to prevent the model from being over-fitted and accurately evaluate the generalization ability of the model, the integrated learning model is evaluated by using a cross validation method in machine learning, so that the scientificity and reliability of model evaluation are further improved; the classical methods of cross validation mainly include: the invention relates to a leave-one method, a K-fold cross validation method, a self-service method and the like, wherein the self-service method is used for cross validation, and the steps are as follows:

(1) Randomly selecting one sample in a data set containing N samples each time, and taking the sample as a training sample;

(2) Putting the randomly selected samples in the step (1) back into the original data set, and sampling the samples in a put-back mode for N times to generate a data set with the same size as the original data set, wherein the new data set is a training set;

(3) After N times of extraction, the original data set probably has

Will not appear in the new dataset, and therefore, samples that do not appear in the new dataset will be taken as validation sets;

(4) Repeating the above steps M times, M models can be trained, and the values of their evaluation indexes can be obtained, and then taking the average value, the performance evaluation value of the model can be obtained.

Through the steps, based on the complex network theory and the integrated learning theory method, from the perspective of the complex network, the two dimensions of the function and the structure of the traffic system are comprehensively considered, and scientific and reliable technical support and theoretical support are provided for the identification of traffic risks; the technical method provided by the invention can efficiently and accurately identify and predict the risk of the traffic system, and provides important support for risk diagnosis of the traffic system, establishment of targeted management control measures and improvement of traffic operation reliability.

(III) advantages and effects

The invention provides a traffic risk prediction method based on a complex network theory, which has the following advantages:

(1) Global property: the invention constructs the traffic network model from the micro and macro two levels to extract the function and structure characteristics, greatly improves the accuracy of the risk prediction of the traffic system, and has great significance for understanding the risk evolution mechanism of the traffic system and improving the reliability of the traffic system;

(2) And (3) timeliness: the invention can monitor the traffic state and predict the future risk in real time, and provides powerful support for the formulation and implementation of a risk control strategy of a traffic system, thereby ensuring the healthy and stable operation of the system;

(3) Expandability: the risk prediction method provided by the invention can be expanded to the risk identification and prediction of other types of complex systems, such as biological systems, communication systems, financial systems and the like.

(4) The method of the invention is scientific, has good manufacturability and has wide popularization and application value.

Drawings

Fig. 1 is a flow chart of a traffic risk prediction method according to the present invention.

FIG. 2 is a traffic risk profile of the present invention.

FIG. 3 is a logic diagram of the process of wrapped feature selection of the present invention.

Fig. 4 is a random forest model architecture diagram of the present invention.

FIG. 5 is a trend chart of evaluation indexes of the random forest model of the present invention.

The sequence numbers, symbols and code numbers in the figure are explained as follows:

s: the area of the region of interest;

V _i ：t _i the speed vectors of R roads at the moment;

an initial velocity matrix;

compensating the normalized speed matrix;

G ₁ (N ₁ ,L ₁ ): a grid traffic congestion network model;

G ₂ (N ₂ ,L ₂ ): a mesh node traffic network model;

q (t): a seepage threshold of the traffic network at time t;

V _{i_ratio} : a normalized velocity vector;

M _f : an initial feature matrix;

the screened high-quality feature matrix;

a high-quality feature matrix after feature scaling;

f(x _i ): risk status of ith grid area

Accuracy: the model accuracy rate;

recall: model recall;

TP: the number of correctly predicted positive examples;

TN: the number of negative cases correctly predicted;

FP: predicting negative examples as the number of positive examples;

FN: the positive examples are predicted as the number of negative examples.

Detailed Description

In order to make the technical problems and technical solutions to be solved by the present invention clearer, the following detailed description is made with reference to the accompanying drawings and specific embodiments. It is to be understood that the embodiments described herein are for purposes of illustration and explanation only and are not intended to limit the invention.

The invention is further described in the following description and embodiments with reference to the drawings.

The actual traffic system data used in the embodiment of the present invention is obtained by counting the real-time speed data of the floating cars on each road section within a certain time span of all roads in the five-ring area in Beijing, which is provided by QF technology corporation, at a time interval of 1 minute and a time granularity of higher, and at the same time interval of 0 to 00-23.

The traffic risk prediction method based on the complex network theory of the embodiment of the invention is shown in figure 1, and the specific implementation steps are as follows:

and B: extracting and screening features based on a complex network theory;

and C: risk prediction is carried out based on an ensemble learning theory;

step D: and (5) evaluating and verifying the model.

The step A of establishing the double-layer traffic network model based on the empirical data division grids comprises the following steps of: firstly, acquiring basic information of roads in a research area, wherein the basic information mainly comprises two parts, namely traffic network road information and longitude and latitude information of a traffic road intersection, dividing the basic information into N-M grid areas according to the area and the size of the research area range and the longitude and latitude information of road sections and intersections, and labeling the grid areas; secondly, aiming at each grid area, constructing a grid traffic jam network model by using a complex network theory and a method according to actual traffic data, intersection as a node, road section as an edge and relative speed of the road section as an edge weight in a grid on a microscopic level; on a macroscopic level, each grid area is used as a node, whether congestion roads exist between grids is used as a judgment bar for judging whether edges are connected, the number of the congestion roads existing between the grids is used as an edge weight, and a grid node traffic network model is constructed by applying a complex network theory and a method.

Step A1: dividing grid areas based on geographic information;

step A2: preprocessing speed data to obtain relative speed matrix

/>

Step A3: construction ofGrid traffic jam network model G ₁ (N ₁ ,L ₁ )；

The step A1 of "dividing the grid area based on the geographic information" specifically includes the following steps: firstly, extracting traffic network models and traffic road information required by grid area division by utilizing a Python language Mapinfo file, wherein the extracted information mainly comprises vehicle-mounted speed of each road at each moment, longitude and latitude information of intersections, network topological structure information of a Beijing city five-ring traffic system and the like; secondly, according to the obtained five-ring traffic road information in Beijing and the longitude and latitude information of the crossroad, calculating that the area S in the five-ring area of Beijing is 667 square kilometers, the longitude range is 116.20-116.56, the latitude range is 39.76-40.03, and scientifically and reasonably determining that the number of divided grids is 2500 according to the actual background condition in the five-ring area of Beijing, so that the area of each grid is 516m; and finally, according to the divided grid areas, counting which intersections are in the grid according to the longitude and latitude information of each intersection in the traffic network aiming at each grid area, and recording.

The "speed data preprocessing" described in step A2 obtains a relative speed matrix

", it is done as follows: in this step, first, actual traffic operation data of a vehicle-mounted Global Positioning System (GPS) is acquired at an arbitrary timing t _i The corresponding speeds of all R roads are determined according to the roadsSequential relation, expressed in vector form V _i ＝(v ₁ ,v ₂ ,…,v _R ) (ii) a Further, the above process is repeated for all T moments, and finally the velocity vectors V at all moments are integrated _i Generating an initial speed matrix>

Secondly, in the process of collecting the speed information of the five-ring traffic system in Beijing by using the floating car technology, the speed information of each area at each moment can not be completely collected and reserved due to the influence of the network communication technology and human and natural factors, so that the original speed information of the traffic system needs to be subjected to speed compensation processing, namely an original speed matrix (or matrix) is used for judging whether the original speed information is the original speed information or not>

There is a partial missing value (actually recorded as 0) and therefore, it is necessary to find the velocity matrix

The velocity missing value in (1), i.e. the element with the value of 0 in the matrix, is subjected to velocity compensation; for t _i Road R at time _j Is compensated for by first finding the road R in the road network G (N, L) _j Is selected based on the set of neighbor roads>

Searching whether the speed record exists on the road in the set at the moment, and if one element in the set has the speed record, taking the average value of the elements in the set, wherein the specific formula is as follows:

in the above formula, the first and second carbon atoms are,

Is not a sum of 0 element values, J represents a speed-missing road R _j In a neighbor road set>

The number of elements other than 0.

Updated to compensated->

/>

In the original absolute velocity matrix

After the road speed compensation is completed, because the road grades at all levels are different, the compensated speed matrix is normalized to obtain the relative speed, and the judgment standard is unified. For any road R _j Based on the speed matrix->

In which a speed vector at all times of the link is taken>

And extracts the maximum speed limit of the road section

The speed vector at this moment is combined>

Is divided by the maximum speed limit->

To obtain a normalized velocity

Get the normalized speed matrix->

As follows:

construction of grid traffic congestion network model G described in step A3 ₁ (N ₁ ,L ₁ ) ", it is as follows: aiming at each grid area divided in the step A, firstly, according to the five-ring actual map data in Beijing City under each grid area, the structure information between roads and the road intersection information contained in each grid area are extracted by software tools such as Python, mapinfo and the like; secondly, selecting a five-ring traffic network in Beijing; then, according to a complex network method, abstracting a road intersection in each grid area as a node in the network, abstracting a road in the traffic network of the grid area as a connecting edge between nodes in the network, and taking the relative speed of each road as the weight of the connecting edge so as to establish a grid traffic congestion network in each grid area; meanwhile, most roads of the five-ring traffic network in Beijing are in bidirectional driving and have directionality, so the grid traffic jam network constructed by the invention is a directed weighted networkLinking the collaterals.

"construction of mesh node traffic network model G" described in step A4 ₂ (N ₂ ,L ₂ ) ", it is done as follows: firstly, constructing an intersection traffic network model between grids according to intersection information contained in each grid area and road topological structure information of a whole Beijing city five-ring traffic network (whole network), namely deleting the road topological structure information contained in the grid area on the basis of the whole network; secondly, counting the number of congested roads between the grid areas and recording the number; and finally, abstracting a grid area into nodes, abstracting whether congestion roads exist between grids as connecting edges or not by applying a complex network theory and a complex network method according to the information, and establishing a grid node traffic network model by taking the number of the congestion roads between the grids as connecting edge weights.

The method for extracting and screening the features based on the complex network theory in the step B comprises the following steps: for each time t _i The grid traffic congestion network and the grid node traffic network (referred to as a double-layer traffic network for short) set a seepage threshold q (t) for seepage analysis, and determine the seepage threshold q (t) =0.5 through the seepage analysis of the double-layer traffic network; secondly, aiming at each grid traffic jam network and nodes (grids) in the grid node traffic network at each moment under the condition that the seepage threshold value is 0.5, extracting the characteristics of each grid area by using the theory and the method of a complex network, wherein the characteristics comprise the structural and functional characteristics such as maximum jam sub-cluster, node median, node degree mean, the average speed of the grid jam network, the number of first-order neighbor congested roads and the like, screening the extracted characteristics by using a machine learning method on the basis, selecting the characteristics which greatly contribute to the traffic risk identification and prediction effect, constructing a high-quality sample characteristic set, and improving the traffic risk identification and prediction effect and efficiency to the maximum extent; meanwhile, the proportion of the congested road at the t + delta t moment in each grid area to the t momentAnd marking the carved grid area. The specific steps of the process are as follows:

step B1: analyzing seepage of a traffic network;

and step B2: extracting risk characteristics based on a complex network;

and step B3: screening risk characteristics based on machine learning;

the "seepage analysis of the traffic network" described in step B1 is specifically performed as follows: a seepage theory is applied to carry out seepage analysis on a double-layer traffic network, firstly, a control variable, namely a seepage threshold value is given for the traffic network at each moment, and is set as q (t), so that each road in the traffic network can present two states: unblocked state (i.e. v) _{i_ratio} (t) > q (t)) and congestion status (i.e., v _{i_ratio} (t) q (t)); deleting the unblocked connecting edge in the traffic network from the original network, and keeping the jammed connecting edge in the original traffic network, wherein the rest network is the traffic network in a jammed state at the moment t, and is referred to as the jammed network for short; the next q (t) value corresponds to a congestion network at each moment, and as the q (t) value is reduced, the congestion network becomes more congested, namely, the more invalid edges, the more sparse the traffic network becomes, therefore, when a proper seepage threshold value q (t) =0.5 is selected, namely, the urban traffic network is in a stage with the most abundant congestion information, the traffic congestion risk at the current moment is identified and predicted;

the "extraction of risk features based on a complex network" described in step B2 is specifically performed as follows: in the step, the grid traffic congestion network and the grid node traffic network are constructed at each moment under the condition that the seepage threshold q (t) =0.5, and from the angle of statistical physics, a complex network theory and a complex network method are used for preliminarily extracting micro and macro characteristics of the grid area of the double-layer traffic network at each moment from the angle of structure and function. Firstly, on a microscopic level, each grid traffic congestion network is used as a research object, and the microscopic features of each grid area are calculated at the key seepage threshold value at each moment; the grid traffic congestion network has different characteristics at different moments, and the congestion network in the grid area can show dynamic characteristics in space along with the evolution of time, so that the grid traffic congestion network has a spatio-temporal characteristic; secondly, on a macro level, aiming at the constructed grid node traffic network model, taking the nodes (grid areas) thereof as research objects, and calculating the macro features of the grid areas (nodes) at each moment, as shown in fig. 2, such as the micro features: the maximum congestion subgroup, the mean value of node betweenness, the mean value of node degree, the mean value of aggregation coefficient, the average speed and the growth rate of the congestion network of the grid traffic congestion network, and the like, wherein the macro characteristics are as follows: the node average path length, the node strength, the node betweenness, the node degree, the growth rate and the like of the grid node traffic network.

In the invention, a method is provided for extracting features from the perspective of a complex network, the feature extraction of a grid is exemplified, and the features of an actual five-ring traffic system in Beijing City can be preliminarily extracted in a targeted manner according to the actual background and situation of the system and from two aspects of the structure and the function of the system, so as to construct a sample feature set and an initial feature matrix M _f The dimension is (8752, 40, 30), i.e. 8752 samples, each sample having 40 features.

The "risk feature screening based on machine learning" described in step B3 is specifically performed as follows: in step B2, extracting the features of the function and the structure of the grid region at each moment based on the related knowledge of the complex network, and then constructing an initial feature matrix M _f In order to improve the accuracy and precision of risk identification and prediction in a five-ring traffic system in Beijing, a relevant theoretical method of machine learning is applied to carry out feature selection on a preliminarily constructed sample feature set in the steps, so that a high-quality sample feature set is screened out, and the effect of risk identification and prediction in the traffic system is improved to the greatest extent; meanwhile, the structure and functional characteristics of the five-ring traffic system in Beijing are screened, important characteristics are screened out, irrelevant characteristics are removed, dimension disasters can be relieved, the difficulty of learning tasks is reduced, and the generalization capability of an over-fitting enhanced machine learning model is reduced; aiming at the five-ring traffic system in Beijing city, the five-ring traffic system has the characteristic of high complexity of space-time evolution and aims to carry out learning on a given learnerOptimization, the present invention uses the relatively classical LVW (Las Vegas Wrapper) method in the wrapped-type for feature selection, as shown in FIG. 3. The LVM method is applied to screen out high-quality samples with the characteristics as follows: the point betweenness variance, the edge betweenness variance, the grid congested road proportion and the node betweenness of the grid node traffic network are 10 characteristics in total, and a high-quality characteristic matrix is constructed

The dimension is (8752, 10, 30), i.e. a total of 8752 samples, each sample sharing 10 high quality features.

Wherein, the step C of 'risk identification and prediction based on ensemble learning theory' comprises the following steps: in order to accurately identify and predict the congestion risk in the five-ring traffic system in Beijing, and effectively control the congestion risk, the method comprises the following steps of firstly constructing an integrated learning model by using machine learning and mathematical related knowledge; secondly, in order to eliminate the influence of non-uniform dimension among the feature vectors on the model, a feature scaling method is used for data feature set

Standardized processing is carried out to obtain a standard sample characteristic matrix->

Dimension (8752, 10, 30); finally, in order to ensure that the model learns the characteristic knowledge of the risk in the five-ring road traffic system in Beijing City as much as possible, the standard sample characteristic matrix is subjected to the characteristic matrix>

According to the weight ratio of 7:3, the proportion of the data is divided into a training set and a test set, namely, the number of samples in the training set is 6126, the number of samples in the test set is 2626, the data in the training set is used for training an ensemble learning model, and then the trained ensemble learning model is used for identifying and predicting risks in a grid area of the traffic system at the current moment. The specific steps of the process are as follows:

step C1: constructing an integrated learning model;

and C2: risk identification and prediction are carried out by applying an ensemble learning model;

the "building ensemble learning model" described in step C1 is specifically performed as follows: the invention aims to learn a more stable model with better performance by using risk historical data information of a five-ring traffic system in Beijing, and the integrated learning model is more prominent in learning compared with a single classifier model. The ensemble learning is to combine a plurality of weak supervision models to obtain a better and more comprehensive strong supervision model, and the potential core idea is that even if a certain weak classifier obtains wrong prediction, other weak classifiers can correct the errors, the current mainstream ensemble learning framework comprises Bagging, boosting and Stacking.

A special binary classifier, namely a random forest model, is constructed through the processes, risks in a five-ring traffic system in Beijing are identified and predicted, in the process, a classification function is a symbolic function, output values are 0 and 1, and low risks and high risks in a grid area are respectively represented as follows:

in the above formula, f (x) _i ) Indicating the risk status of the ith grid area, with 0 representing a low congestion risk and 1 representing a high congestion risk.

Meanwhile, an ensemble learning model is constructed by applying an ensemble learning theory to identify and predict risks of the five-ring traffic system in Beijing according to the distribution characteristics of data samples, and a proper ensemble learning framework and model can be selected to identify and predict the risks, so that the effects of identifying and predicting the risks of the traffic system are further improved.

In step C2, "risk identification and prediction using ensemble learning model" specifically includes the following steps: in this step, based on the feature set of the high-quality sample extracted and screened in the step C, i.e. the feature matrix

And (4) identifying and predicting risks in the traffic system by using the ensemble learning model constructed in the step C1. Because the difference between the characteristic dimensions in the historical sample data set can affect the performance of the ensemble learning model, when the model is used for risk identification and prediction, firstly, the sample feature set of a research object needs to be subjected to feature scaling, the influence of different dimensions among feature vectors on the model precision is eliminated, the convergence speed of the model is improved, and a standard sample feature matrix ^ is obtained>

The mainstream method for feature scaling selects a standardized feature scaling method according to the actual situation of the five-ring traffic system in Beijing, the characteristics of the data feature set and the applied machine learning method, thereby ensuring the maximum accuracy and precision of risk identification and prediction in the traffic system.

After feature scaling is performed on the sample data set in the five-ring road traffic system in Beijing, in the step, the standard sample feature matrix based on the traffic system

Identifying and predicting risks in the traffic system by using the random forest model constructed in the step C1, wherein in the process, the random forest model needs to learn the characteristics of the risks, so that the standard sample characteristic set is used for learning the characteristics of the risks, and the embodiment performs judgment on the basis of the standard sample characteristic set>

And (3) randomly dividing the random forest into a training set and a testing set according to the proportion of 7, wherein the number of samples in the training set is 6126, the number of samples in the testing set is 2626, and the training set is used for training a random forest model to learn the characteristics of the congestion risk to the maximum extent.

The method for evaluating and verifying the model in the step D comprises the following steps: in the process of identifying and predicting the risk in the traffic system by using the ensemble learning model constructed in the step C, in order to accurately and scientifically evaluate the performance of the model, in the step, firstly, evaluation indexes are reasonably selected based on the actual traffic system condition and the final target of the invention, for example: accuracy, precision, recall, F1 value, etc., the nature of which is calculated from a Confusion Matrix (fusion Matrix); secondly, in order to prevent the model from being over-fitted and accurately evaluate the generalization ability of the model, the ensemble learning model is evaluated by using a cross validation method in the step, so that the scientificity and the reliability of the evaluation of the model are further improved. The method specifically comprises the following substeps:

step D1: selecting a model evaluation index;

step D2: evaluating and analyzing the model;

the specific way of selecting the model evaluation index in the step D1 is as follows: the invention aims at identifying and predicting risks in a traffic system, and the final aim is to accurately and scientifically identify the risks in the traffic system by using an integrated learning model, which essentially belongs to the problem of abnormal detection in machine learning. According to the invention, the risk identification detection problem is faced in a scene, under the scene, the model is evaluated by adopting two evaluation indexes of recall rate and accuracy, and the formula is as follows:

in the formula, accuracy represents Accuracy, recall represents recall, and TP is the number of correct predicted cases; TN is the number of correctly predicted negative cases, FP is the number of positive cases predicted from negative cases, and FN is the number of negative cases predicted from positive cases.

The prediction error condition of the units with real risks in the road traffic system in the five rings in Beijing City should be less and better, because if the real congestion risk in the road traffic system in the five rings in Beijing City is not identified, once the real congestion risk occurs, the traffic system is damaged to a great extent, and therefore, the recall rate needs to be concerned more; meanwhile, in order to ensure that the normal and accurate prediction is normal, reduce the error rate of the normal sample prediction and enable a manager of the traffic system to accurately manage and control the real risk in the traffic system to the maximum extent under the limited resource cost, the accuracy rate is introduced as the evaluation index of the model. The random forest model in the ensemble learning is used for identifying and predicting the congestion risk of the road traffic system in the five rings of Beijing city, the accuracy rate is 89.83%, the recall rate is 86.74%, the level is high, and the performance of the model is good.

The "evaluation and analysis of the model" described in step D2 is specifically performed as follows: in the step, in order to prevent the model from being over-fitted and accurately evaluate the generalization ability of the model, the ensemble learning model is evaluated by using a cross validation method in machine learning, and the scientificity and reliability of model evaluation are further improved. The classical methods of cross validation mainly include: the invention relates to a leave-one method, a K-turn cross validation method, a self-help method and the like, wherein the self-help method is used for cross validation, and the steps are as follows:

(1) Randomly selecting one sample at a time in a data set containing 8752 samples, and using the sample as a training sample;

(2) Putting the randomly selected sample in (1) back into the original data set, and then sampling 8752 times in a putting-back mode to generate a data set with the same size as the original data set, wherein the new data set is a training set;

(3) Through 8752 times of extraction, 3221 samples in the original data set can not appear in the new data set, and therefore, the samples which do not appear in the new data set are taken as a verification set;

(4) Repeating the above steps 10 times, 10 models can be trained, and the values of the evaluation indexes can be obtained, and then averaging is performed, so that the performance evaluation value of the model can be obtained.

As shown in fig. 5, a random forest model is used for identifying and predicting the congestion risk of the road traffic system in the five rings of beijing city, and a self-service method is used for performing cross validation on the model for 10 times, wherein the average value of the accuracy is about 92.84%, the average value of the recall rate is about 92.45%, and the average value is at a higher level, which indicates that the model has stronger generalization capability and better performance, can accurately and reliably identify and predict the congestion risk of the road traffic system in the five rings of beijing city, and provides powerful guarantee for ensuring safe, stable and healthy operation.

The invention has not been described in detail and is within the skill of the art.

The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A traffic risk prediction method based on a complex network theory is characterized in that: the method comprises the following steps:

and B: extracting and screening features based on a complex network theory;

and C: risk prediction is carried out based on an ensemble learning theory;

step D: evaluating and verifying the model;

the method for constructing the double-layer traffic network model based on the empirical data division grids in the step A comprises the following steps: firstly, acquiring basic information of roads in a research area, wherein the basic information comprises two parts, namely traffic network road information and longitude and latitude information of a traffic road intersection, dividing the basic information into N-M grid areas according to the area and the size of a research area range and the longitude and latitude information of road sections and the longitude and latitude information of the intersection, and labeling the grid areas; secondly, aiming at each grid area, constructing a grid traffic jam network model by using a complex network theory and a method according to actual traffic data, intersection as a node, road section as an edge and relative speed of the road section as an edge weight in a grid on a microscopic level; on a macroscopic level, each grid area is used as a node, whether congested roads exist between grids is used as a judgment bar for judging whether edges are connected or not, the number of the congested roads existing between the grids is used as an edge weight, and a grid node traffic network model is constructed by applying a complex network theory and a method; the specific method comprises the following steps:

step A1: dividing grid areas based on geographic information;

step A2: preprocessing speed data to obtain relative speed matrix

Step A3: constructing a grid traffic jam network model;

step A4: constructing a grid node traffic network model;

the step A1 of dividing the grid area based on the geographic information includes the following specific steps: firstly, extracting traffic network models and traffic road information required for dividing grid areas from a geographic information system (Mapinfo file) by using programming software Python, wherein the extracted information comprises vehicle-mounted speed of each road at each moment, longitude and latitude information of intersections and network topological structure information of a researched traffic system, calling a Baidu map Application Programming Interface (API) by using Python and matching the topological structure of the road network and the names of the intersections by adopting a sequential traversal method to obtain longitude and latitude information of the intersections, and processing the road and intersection information which cause longitude and latitude acquisition failure due to the difference of the names of the road intersections on the Baidu map and the Mapinfo to obtain an accurate standard traffic system road network longitude and latitude information data set; secondly, calculating the area S and the latitude and longitude dereferencing range of the researched area according to the obtained traffic road information and the longitude and latitude information of the intersection of the researched area, and scientifically and reasonably determining the number of the divided grids to be N x M according to the actual background condition of the researched area, so that the area of each grid is S/(N x M); finally, according to the divided grid areas, counting which intersections are in the grid according to the longitude and latitude information of each intersection in the traffic network and recording;

wherein, the speed data preprocessing described in the step A2 is used to obtain the relative speed matrix

The method comprises the following steps: in this step, first, actual traffic operation data of a GPS, which is a vehicle-mounted global positioning system, is acquired at an arbitrary time t _i Expressing the speeds corresponding to all R roads into a vector form V according to the sequence relation of the roads _i ＝(v ₁ ,v ₂ ,…,v _R ) (ii) a Repeating the above process for all T moments, and finally integrating the velocity vectors V of all moments _i Generating an initial velocity matrix

Secondly, the original speed information of the traffic system needs to be speed-compensated, i.e. the original speed matrix ≥ is>

In which there is a partial deletionValue, therefore, the lookup speed matrix->

The velocity missing value in (1), that is, the element with the value of 0 in the matrix, is used for velocity compensation; for t _i Road R at time _j To compensate for the speed loss value, first find the road R in the road network _j Is selected based on the set of neighbor roads>

in the above-mentioned formula, the compound has the following structure,

road R indicating lack of speed _j At t _i The instant speed compensation value->

The number of elements other than 0;

if the road R _j All the neighboring road speeds are not recorded, the road R is determined _j Is compensated to 0, the original speed is compensated after each compensationDegree matrix

Updated to compensated->

Repeating the above process at each moment until all 0 values in the speed matrix are compensated to obtain a completed speed matrix->

In the original absolute velocity matrix

Extracts the speed vector at all moments of the road>

And extracts the maximum speed limit for the section>

The speed vector at this moment is combined>

Is divided by the maximum speed limit->

To obtain a normalized speed->

Get the normalized speed matrix->

As follows:

the method for constructing the grid traffic congestion network model in the step A3 comprises the following specific steps: for each grid area divided in the step A, firstly, according to actual map data under each grid area, using Python and Mapinfo software tools to extract structure information among roads and road intersection information contained in each grid area; secondly, selecting a proper geographical coverage area of traffic according to the requirement of actual research, abstracting a road intersection in each grid area as a node in the network according to a complex network method, abstracting the road in the grid area traffic network as a connecting edge between nodes in the network, and taking the relative speed of each road as the weight of the connecting edge so as to establish a grid traffic congestion network in each grid area; meanwhile, most roads of the traffic network are driven in two directions and have directionality, so the constructed grid traffic jam network is a directed weighted network;

wherein, the construction of the mesh node traffic network model in the step A4 specifically comprises the following steps: firstly, according to intersection information contained in a plurality of grid areas and road topological structure information of a whole research area traffic network, namely the whole network, an intersection traffic network model between grids is constructed, namely the road topological structure information contained in the grid areas is deleted on the basis of the whole network; secondly, counting the number of congested roads between the grid areas and recording the number; finally, abstracting a grid area into nodes, abstracting whether congestion roads exist between grids into connecting edges or not by applying the theory and the method of a complex network, and establishing a grid node traffic network model by taking the number of the congestion roads between the grids as the weight of the connecting edges;

the feature extraction and screening based on the complex network theory in the step B comprises the following steps: for each time t _i The grid traffic congestion network and the grid node traffic network are referred to as a double-layer traffic network for short, a seepage threshold q (t) is set for seepage analysis, and the seepage threshold q (t) is determined through the seepage analysis of the double-layer traffic network; secondly, aiming at each grid traffic jam network and each node in the grid node traffic network, namely the grid, under the seepage threshold q (t) of each moment, extracting the characteristics of each grid area by using the theory and the method of a complex network, wherein the characteristics comprise maximum jam subgroups, node median, node degree mean, the average speed of the grid jam network and the number structure and the functional characteristics of first-order neighbor jam roads, screening the extracted characteristics by using a machine learning method on the basis, selecting the characteristics which greatly contribute to the traffic risk identification and prediction effect, constructing a high-quality sample characteristic set, and improving the traffic risk identification and prediction effect and efficiency; meanwhile, labeling the grid area at the time t according to the proportion of the congested road at the time t + delta t in each grid area; the method comprises the following specific steps:

step B1: analyzing seepage of a traffic network;

and step B2: extracting risk features based on a complex network;

and step B3: screening risk characteristics based on machine learning;

the seepage analysis of the traffic network in step B1 is specifically performed as follows: carrying out seepage analysis on the double-layer traffic network by using a seepage theory; firstly, a control variable, namely a seepage threshold value is given to a traffic network at each moment, and is set as q (t), so that each road in the traffic network can present two states: unblocked state i.e. v _{i_ratio} (t) > q (t) and congestion status v _{i_ratio} Q (t) is less than or equal to q (t); the free links in the traffic network are removed from the original networkDeleting the network, namely reserving the jammed connecting edge in the original traffic network, wherein the remaining network is the traffic network in the jammed state at the moment t, and is referred to as the jammed network for short; the next q (t) value corresponds to a congestion network at each moment, and as the q (t) value is reduced, the congestion of the traffic network becomes higher, namely, the number of invalid edges is increased, the traffic network becomes more sparse, so that the traffic congestion risk at the current moment is identified and predicted when a proper seepage threshold q (t), namely the urban traffic network is in a stage with the most abundant congestion information, is selected;

wherein, the extracting of the risk characteristics based on the complex network in the step B2 specifically includes the following steps: constructing a grid traffic jam network and a grid node traffic network at each moment under a seepage threshold q (t), and preliminarily extracting micro and macro characteristics of a grid area of a double-layer traffic network at each moment from the two aspects of structure and function by applying a complex network theory and a complex network method from the viewpoint of statistical physics; firstly, on a microscopic level, each grid traffic jam network is used as a research object, and the microscopic characteristics of each grid area are calculated at a key seepage threshold value at each moment; the grid traffic congestion network has different characteristics at different moments, and the congestion network in the grid area can show dynamic characteristics in space along with the evolution of time, so that the grid traffic congestion network has a spatio-temporal characteristic; secondly, on a macroscopic level, aiming at the constructed grid node traffic network model, taking a node, namely a grid area, as a research object, calculating a macroscopic feature of the grid area, namely the node, a maximum congestion subgroup, a mean value of node betweenness, a mean value of node degree, a mean value of aggregation coefficient, an average speed and an increase rate of a congestion network of the grid traffic congestion network at each moment, wherein the macroscopic feature comprises the following steps: the average path length of nodes, the strength of the nodes, the node betweenness, the node degree and the growth rate of the nodes of the grid node traffic network; pertinently and preliminarily extracting the characteristics of the target to construct a sample characteristic set, and constructing an initial characteristic matrix M _f ；

Wherein, the risk feature screening based on machine learning in step B3 specifically includes the following steps: based on repetition in step B2Extracting the features of the function and the structure of the grid region at each moment by the related knowledge of the hybrid network, and then constructing an initial feature matrix M _f In the step, a relevant theoretical method of machine learning is used for carrying out feature selection on the preliminarily constructed sample feature set, so that a high-quality sample feature set is screened out, and the effects of risk identification and prediction in a traffic system are improved; meanwhile, the structure and function characteristics of the traffic system are screened, important characteristics are screened out, and irrelevant characteristics are removed; the characteristic selection is carried out by applying a classic LVW method in a wrapping mode, and the specific steps are as follows:

(1) Setting an initial optimal error E to be infinite, setting the current optimal feature subset as an attribute complete set A, and setting the repetition times t =0;

in the calculation process, the LVM method directly takes the performance of the finally used learner as the evaluation criterion of the feature subsets, selects the feature subsets which are most beneficial to the performance and customized for the given learner, screens out high-quality sample feature sets, and constructs a feature matrix

Wherein, the risk identification and prediction based on the ensemble learning theory in the step C is performed as follows: firstly, constructing an integrated learning model by using machine learning and mathematic related knowledge; secondly, data feature set is scaled by using feature scaling method

Finally, the characteristic matrix of the standard sample is evaluated>

Dividing the data into a training set and a test set according to a preset proportion, training an ensemble learning model by using the training set data, and then identifying and predicting risks in a grid area of the traffic system at the current moment by using the trained ensemble learning model; the method comprises the following specific steps:

step C1: constructing an ensemble learning model;

and step C2: carrying out risk identification and prediction by using an ensemble learning model;

wherein, in the step C1, the integrated learning model is constructed by the following specific steps: a random forest model is constructed by using a Bagging framework and an integrated learning related theoretical method to identify and predict risks of a traffic system, and the method comprises the following implementation steps:

(1) Let exist in dataset D = { x = _i1 ,x _i2 ,…,x _in ,y _i },i∈[1,m](ii) a With a number of features N, with samples returned to generate a sampling space (m x N) ^m*n ；

(2) Constructing a base learner, namely a decision tree: for each sample d _j ＝{x _i1 ,x _i2 ,…,x _ik ,y _i },i∈[1,m](ii) a Where K < M, generating decision trees and recording the result h of each decision tree _j (x)；

(3) Train for T times

Wherein the first and second phases are represented by phi (x),

a binary classifier, namely a random forest model, is constructed through the processes, risks in the traffic system are identified and predicted, in the process, the classification function is a symbolic function, the output values are 0 and 1, and the low risk and the high risk of the grid area are represented respectively as follows:

meanwhile, an ensemble learning model is constructed by applying an ensemble learning theory to identify and predict risks of the traffic system, and a proper ensemble learning framework and model can be selected according to the distribution characteristics of data samples to identify and predict the risks, so that the effects of identifying and predicting the risks of the traffic system are improved;

wherein, in the step C2, the risk identification and prediction are performed by using the ensemble learning model, specifically as follows: based on the high-quality sample feature set extracted and screened in the step C, namely a feature matrix

Identifying and predicting risks in the traffic system by using the ensemble learning model constructed in the step C1; when the model is used for risk identification and prediction, firstly, the sample feature set of a research object is subjected to feature scaling, the influence of different dimensions among feature vectors on the model precision is eliminated, the convergence speed of the model is improved, and a standard sample feature matrix/based on the condition of the condition is obtained>

The mainstream feature Scaling method in machine learning comprises min-max normalization, mean normalization, standardization and Scaling to unit length, which are sample feature sets for a traffic system>

Mainstream methods of feature scaling;

after feature scaling is carried out on a sample data set in a traffic system, a standard sample feature matrix based on the traffic system

Identifying risks in the traffic system by applying the ensemble learning model constructed in the step CIn the process of learning, the ensemble learning model requires learning of a learning-risky feature, whereupon a standard sample feature set>

Randomly dividing the training set into a training set and a test set according to a preset proportion, wherein the training set is used for training a random wheel model and learning the characteristics of risks, the test set is used for testing the training effect of the model, and the training and the testing are repeated by changing the proportion of the training set to the test set until the effect of the model is optimal;

wherein, the model evaluation and verification in step D is performed as follows: in the process of identifying and predicting risks in the traffic system by using the integrated learning model constructed in the step C, firstly, reasonably selecting an evaluation index based on the actual traffic system condition and the final target, and calculating according to a Confusion Matrix, namely the fusion Matrix; secondly, evaluating the ensemble learning model by using a cross validation method, and improving the scientificity and reliability of model evaluation; the method specifically comprises the following substeps:

step D1: selecting a model evaluation index;

step D2: evaluating and analyzing the model;

selecting a model evaluation index in the step D1 specifically comprises the following steps: identifying and predicting risks in a traffic system, and evaluating a model by adopting two evaluation indexes of recall rate and accuracy, wherein the formula is as follows:

in the formula, accuracy represents Accuracy, recall represents recall, and TP is the number of correct predicted cases; TN is the number of correctly predicted negative cases, FP is the number of positive cases predicted from negative cases, FN is the number of negative cases predicted from positive cases;

the evaluation analysis of the model in step D2 is specifically performed as follows: the integrated learning model is evaluated by using a cross validation method in machine learning, so that the scientificity and reliability of model evaluation are improved; the self-help method is used for cross validation, and the steps are as follows:

(2) Putting the samples randomly selected in the step (1) back into the original data set, and sampling the samples in a place-by-place manner for N times to generate a data set with the same size as the original data set, wherein the new data set is a training set;

(3) After N times of extraction, the original data set comprises

Will not appear in the new data set, and therefore, will take samples that do not appear in the new data set as the validation set;

(4) Repeating the steps M times, training M models and obtaining the values of the evaluation indexes of the models, and then taking the average value to obtain the performance evaluation value of the model.