CN109191828B

CN109191828B - Traffic participant accident risk prediction method based on ensemble learning

Info

Publication number: CN109191828B
Application number: CN201810783019.6A
Authority: CN
Inventors: 刘林; 陈凝; 吕伟韬; 李璐
Original assignee: Jiangsu Zhitong Traffic Technology Co ltd
Current assignee: Jiangsu Zhitong Traffic Technology Co ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2021-05-28
Anticipated expiration: 2038-07-16
Also published as: CN109191828A

Abstract

The invention provides a traffic participant accident risk prediction method based on ensemble learning, which is characterized in that traffic violation data and accident data samples are obtained by an optimized sampling method, an ensemble learning algorithm is adopted to train a personnel traffic accident risk prediction model, automatic judgment of high-risk personnel is realized, and a traffic participant accident risk prediction index is obtained.

Description

Traffic participant accident risk prediction method based on ensemble learning

Technical Field

The invention relates to a traffic participant accident risk prediction method based on ensemble learning.

Background

Traffic participants are the key for influencing road traffic safety, but the traditional research and management application is limited by information acquisition and perception means, and the relevance between the attributes of people and the traffic safety is difficult to be mined, so that the targeted traffic safety control is difficult to be implemented. At present, the traffic safety and standard management work of China is mainly carried out by illegal investigation and treatment, and a large amount of traffic illegal data resources of vehicles and personnel are accumulated. Traffic violation and traffic safety have obvious relevance, so that necessary safety characteristic information of traffic participants can be extracted by performing data mining on traffic violation data.

In the data mining method, Ensemble Learning (Ensemble Learning) has excellent performance, and the method combines several machine Learning techniques into a meta-algorithm (meta-algorithm) of a prediction model to reduce variance (bagging), bias (boosting), or improve prediction (tracking), and helps improve the machine Learning result by combining several models. Compared with a single model, the method can well improve the prediction performance of the model.

The traffic accident risk prediction model of the traffic participants is constructed by an integrated learning algorithm, model fitting is mainly carried out by traffic violation data, the influence of an asymmetric data set on the model performance is reduced by an optimized sampling method, the model accuracy and the misjudgment rate are considered when the model performance is optimized, and the personnel risk prediction accuracy is improved.

Disclosure of Invention

The invention aims to provide a traffic participant accident risk prediction method based on ensemble learning, which adopts an ensemble learning algorithm of optimized sampling to predict and evaluate the traffic safety risk of a traffic participant with traffic violation records, fills the deficiency of the current quantitative analysis method of participant factors in traffic safety, and further improves the initiative and pertinence of traffic safety management work.

According to the invention, a high-risk personnel data set and a general personnel data set are divided through a judgment rule, an optimized sampling method is adopted, classifier training and correction are carried out based on an integrated learning algorithm, an integrated classifier with optimal performance is fitted into a traffic accident risk prediction model of a traffic participant, and personnel traffic safety attributes and risk probability can be output.

The technical solution of the invention is as follows:

a traffic participant accident risk prediction method based on ensemble learning comprises the following steps,

s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data;

s2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified;

s3, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size;

s4 sample size n_m＝s₀(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting sample Nm with the sample size of N from the data set N;

s5, splitting the data set D and the Nm collection Gm into a training set and a test set;

s6, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; when i is equal to 1, ai is equal to 1, and when i is greater than 1, ai is equal to ai-1+1, the initial value of i is 1, and i is provided with a set upper value limit;

s7, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; when j is 1, bj is 1, when j is greater than 1, bj-1+1, the initial value of j is 1, and j is provided with a set upper value limit; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;

s8, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participants

The model can output a marker value and a risk probability;

s9, modeling with the test set data

Evaluating to obtain model accuracy of different coverage

S10, classifying the data in the Nm complement Nm' of the sampling samples in the general personnel data subset N according to the illegal times, and inputting the data into the model according to the classification

Counting the misjudgment rate of personnel labels output by models under different coverage rates

Drawing a model misjudgment rate curve of the classification;

whether S11, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S12, otherwise, turning to S6 if i is i + 1; otherwise, j ═ j +1, go to S7;

s12, detecting whether nm reaches the upper limit value of the sampling interval, if so, entering S13, otherwise, returning to S4 when m is m + 1;

s13, analyzing the model accuracy and the misjudgment rate of S9 and S10 to obtain the model with optimal performance

Determining an optimal random sampling number M, SMOTE sampling proportion I, J, a model coverage rate recall and a model discrimination threshold;

and S14, inputting the subset data to be identified in the step S2 into the model, and determining the corresponding data mark value and the risk probability.

Further, the ensemble learning algorithm in step S8 includes a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, and a GBDT algorithm;

further, the method for assigning the corresponding data label value label based on the classification rule in step S2 specifically includes:

high-risk personnel: one category is traffic participants who have illegal records and have serious traffic accident records with major responsibility or all responsibility; the other type is that illegal records exist, only slight accident records exist, and the accident records are not less than 2 traffic participants;

the average person: traffic participants who have illegal records but no records of accidents;

the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.

Further, the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; collecting and classifying illegal records to obtain an illegal data set; the illegal data set is full sample data of illegal records of personnel, and the illegal data set information comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals.

Further, in step S1, the occurrence condition of the accident-related illegal activity is obtained by a corresponding analysis method, and the type of the violation with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.

Further, in step S1, the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying the discrete variable according to the illegal time characteristics.

The invention has the beneficial effects that:

firstly, the traffic violation data are mined by adopting an integrated learning algorithm, the safety risk prediction based on the violation records of the traffic participants is realized, and the model can output the probability and the attribute of the traffic safety risk of the personnel.

Compared with traditional classification methods such as decision trees, neural networks and the like, the integrated learning algorithm adopted by the invention has obvious advantages in prediction performance, and ensures the accuracy of the prediction of the traffic accident risk of people.

The invention optimizes and improves the sampling link, improves both random sampling and SMOTE sampling, can relieve the problem that the accuracy of the model is influenced by unbalanced data sets to a certain extent, and has obvious effect on improving the performance of the model.

Drawings

Fig. 1 is a flow chart of a traffic participant accident risk prediction method based on ensemble learning according to an embodiment of the present invention.

FIG. 2 is an explanatory diagram of a data set in the embodiment.

FIG. 3 is a diagram illustrating attribute variables of the first 20 bits of importance in the embodiment.

FIG. 4 is a graph of model accuracy versus false positive rate in an example.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

A traffic participant accident risk prediction method based on ensemble learning is disclosed, as shown in fig. 1, and the specific method flow is as follows:

in an embodiment, the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; preprocessing operations such as collection and classification are carried out on the original illegal records to obtain an illegal data set; the law violation data set is full sample data of law violation records of personnel, and the data set information comprises personnel certificate numbers, violation times, violation types, punishment conditions, accident-related law violation behavior occurrence conditions and violation occurrence time intervals.

The occurrence condition of the accident-related illegal activity in the step S1 is obtained through a corresponding analysis mode, and the illegal type with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.

In the illegal occurrence time period in the step S1, the time continuous variable is converted into a discrete variable, and classification is performed according to the illegal time characteristics.

the classification rules are specifically: the high-risk personnel refer to (1) personnel who have illegal records and have serious traffic accident records with main responsibility or all responsibility; (2) illegal records exist, only slight accident records exist, and the accident records are not less than 2 persons; the general personnel refers to personnel who have illegal records but no accident records; the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.

And S3, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval does not exceed 25% of the total sample size generally.

S4 sample size n_m＝s₀(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; a sample Nm of the sample size Nm is randomly drawn from the data set N.

And S5, splitting the data set D and the Nm collection Gm into a training set and a test set.

S6, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; wherein, when i is equal to 1, ai is equal to 1, when i is greater than 1, ai is equal to ai-1+1, the initial value of i is 1, i is provided with a set upper limit,

the upper limit of the value of i is usually 4;

s7, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; when j is 1, bj is 1, and when j >1, bj is bj-1+1, j has an initial value of 1, j has a set upper limit, and j has an upper limit of 4; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;

The model can output a marker value and a risk probability;

s9, modeling with the test set data

Evaluating to obtain model accuracy of different coverage

Drawing a model misjudgment rate curve of the classification;

s13, S9 and S10Model with optimal performance in type accuracy and misjudgment rate analysis

Specific examples

The present embodiment takes a driver of a motor vehicle as an analysis target.

And S1, acquiring the traffic violation records and accident records of 2 years in the area by butting with the database.

The traffic accident with death or serious injury or hit-and-run accident is taken as a serious accident, other accidents are taken as slight accidents, the original accident records are classified according to the serious accident or serious injury or hit-and-run accident, the accident type and driver certificate information are taken as attribute characteristics of a serious accident data set and a slight accident data set, and sample data of the two data sets are obtained.

Further, the illegal original data are preprocessed, and illegal information of the driver is collected and counted, wherein the illegal information comprises accumulated illegal times, illegal types, accumulated deduction scores, average deduction scores (minutes/times), single maximum deduction scores, accumulated fines amount and average fines amount (yuan/times).

The method comprises the steps of performing dimensionality reduction treatment on traffic accident data and illegal original data by adopting a corresponding analysis method, classifying illegal types according to the relevance of the illegal and the type of the accident, and extracting five types with highest relevance as data attributes of an accident risk illegal behavior field, wherein the data attributes are shown in a table 1.

TABLE 1 event-related violation type partitioning

According to the traffic flow operation of the road network of the area where the embodiment is located and the characteristics of the occurrence rule of the traffic violation event, aggregating the time, dividing the analysis time period, and converting the continuous variable into the nominal variable; in another embodiment, the time interval division is performed by other statistical means such as clustering.

Extracting the age, the gender and the province and city code of the driver according to the driver certificate number by the driver characteristic data; and generating an illegal data set according to the information extracted from each link, as shown in table 2.

TABLE 2. partial data of illegal data set

And S2, classifying the full sample I in the illegal data set into two categories, namely a high-risk driver and a common driver. Referring to fig. 4, in a case where a driver who has illegal records and has serious traffic accident records with major responsibility or all responsibility is taken as a high-risk driver, eligible data is classified as a data set D1; dividing the data meeting the conditions into a data set D2 according to another condition that the drivers with illegal records exist, only slight accident records exist and the accident records are not less than 2, and the drivers with the accident records are taken as high-risk drivers; the data set D of the high-risk drivers is D1+ D2. And synthesizing the corresponding data of the drivers with illegal records but no accident records into a general driver data set N.

Accordingly, a high-risk or general data label value label is determined for the data meeting the rule in the illegal data set, and the data subset U which cannot be applied to the classification rule is the data subset to be identified.

In this embodiment, the sample size of the data set exceeds 84000, the sampling interval S is [200,4000], and the loop step k is 200.

In this embodiment, the initial number of samples is 200.

In this embodiment, the split ratio of the training set to the test set is 9: 1.

S6, SMOTE sampling is conducted on the training set, and a high-risk driver data subset D sample expansion proportion ai is set, wherein a1 is equal to 1, ai is equal to ai-1+1, and the maximum value of i is 4.

S7, setting a sample shrinkage proportion bj of the Nm data subset of the general driver for the sample expansion proportion ai of the high-risk driver, wherein b1 is 1, bj is bj-1+1, and the maximum value of j is 4; and for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in the training set to be used as a training sample set of the classifier.

S8, training a high-risk driver classifier by using a random forest algorithm, determining model parameters, and realizing a driver traffic accident risk prediction model

The model can output the driver flag value and the risk probability.

S9, modeling with the test set data

Evaluating to obtain model accuracy of different coverage

S10, classifying the data in the Nm complement Nm' of the sampling samples in the general driver data subset N according to the number of times of violation, and inputting the classified data of 1 time, 2 times, 3 times, 4 times, 5 times, 6 times or more of violation into the model

Counting the misjudgment rate of the driver labels output by models under different coverage rates

Drawing classificationOther model false positive rate curves.

Whether S11, j reaches the set maximum value; if yes, judging whether i reaches a set maximum value, if yes, entering S12, otherwise, turning to S6 if i is i + 1; otherwise, j ═ j +1, go to S7.

And S12, detecting whether nm reaches an interval upper limit S, if so, entering S13, otherwise, returning to S4 when m is m + 1.

And determining an optimal random sampling number M, SMOTE sampling proportion I, J, a model coverage rate recall and a model discrimination threshold.

In the embodiment, a random forest algorithm is adopted, and the proportion of the training set high-risk to the general driver sample expansion and sample contraction starts from 1:1 and ends up to 4: 4; comparing and analyzing the comprehensive misjudgment rate, the accuracy and the index stability to determine the optimal performance model as

Namely, the number of random sampling samples is 2400, the SMOTE ratio is 2:2, and attribute variables of the top 20 bits of importance in the model are shown in FIG. 3; the model coverage rate recall is 0.06, the corresponding model accuracy is 0.889, the misjudgment rate and accuracy curve of the model is shown in fig. 4, the model judgment threshold is 0.98, and the data misjudgment rate of illegal 1 time is slightly higher than that of other types as can be seen from the model performance.

And S14, inputting the subset data to be identified in the step S2 into the model, and determining the corresponding data mark value and the risk probability. The results of the partial judgment are shown in Table 3.

Table 3. high risk driver identification result using the method of the present invention

Claims

1. A traffic participant accident risk prediction method based on ensemble learning is characterized in that: comprises the following steps of (a) carrying out,

s2, classifying the illegal data set into two categories according to the serious traffic accident record of the serious accident data set and the light accident record of the light accident data set, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified;

s3, setting a sampling interval S and a cycle step k according to the sample size of the data set N;

s4 sample size n_m＝s₀+(m-1)·k，s₀Is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly sampling N samples from the data set N_mSample N of_m；

S5, data sets D and N_mCollection G_mSplitting the training set into a training set and a test set;

s6, SMOTE sampling is carried out on the training set, and the sample expansion proportion a of the high-risk personnel data subset D is set_i(ii) a Wherein, when i is 1, a_iWhen i is equal to 1>1 time, a_i＝a_i-1The initial value of +1, i is 1, and i is provided with a set upper value limit;

s7 sample expansion ratio a for high-risk personnel_iSetting general person N_mData subset reduction scale b_j(ii) a Wherein, when j is 1, b_jWhen j is equal to 1>1 time, b_j＝b_j-1The initial value of +1, j is 1, and j is provided with a set upper value limit; sampling ratio a for SMOTE_i:b_jCarrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;

The model can output a marker value and a risk probability;

s9, modeling with the test set data

Evaluating to obtain model accuracy of different coverage

S10, sampling samples N in the general personnel data subset N_mComplement N_m' Indata is classified according to the number of violations and input into the model by category

Drawing a model misjudgment rate curve of the classification;

s12, detecting n_mIf the upper limit value of the sampling interval is reached, the step enters S13, otherwise, m is m +1, and the step returns to S4;

2. The ensemble learning-based traffic participant accident risk prediction method according to claim 1, wherein the ensemble learning algorithm in step S8 includes a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, a GBDT algorithm;

3. the ensemble learning-based traffic participant accident risk prediction method according to claim 1, wherein the method for assigning the corresponding data label value label based on the classification rule in step S2 specifically comprises:

and the data which do not meet the discrimination conditions of the high-risk personnel and the common personnel form a subset to be recognized.

4. The ensemble learning-based transportation participant accident risk prediction method of claim 1, wherein: the original traffic violation data and accident data in step S1 include personnel certificate information; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident illegal behavior occurrence conditions and illegal occurrence time intervals.

5. The ensemble learning-based transportation participant accident risk prediction method of claim 1, wherein: in step S1, the occurrence of the accident illegal activity is obtained by a corresponding analysis method, and the type of the illegal activity with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.

6. The ensemble learning-based transportation participant accident risk prediction method of claim 4, wherein: in step S1, the time-continuous variable is converted into a discrete variable, and the discrete variable is classified according to the characteristics of the time of violation.