CN108596409B

CN108596409B - Method for improving accident risk prediction precision of traffic hazard personnel

Info

Publication number: CN108596409B
Application number: CN201810783017.7A
Authority: CN
Inventors: 刘林; 陈凝; 吕伟韬; 马党生
Original assignee: Jiangsu Zhitong Traffic Technology Co ltd
Current assignee: Jiangsu Zhitong Traffic Technology Co ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2021-07-20
Anticipated expiration: 2038-07-16
Also published as: CN108596409A

Abstract

The invention provides a method for improving the accident risk prediction precision of traffic hazard personnel, which obtains traffic violation data and accident data samples by an optimized sampling method, trains a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and optimizes the model by a genetic algorithm. According to the method, the safety characteristics of traffic travelers are mined in traffic violation data by an integrated learning algorithm, the performance based on an initial model is improved by adopting an optimized sampling method in a sampling link of model construction, and the model parameters are optimized by using a genetic algorithm, so that the accident risk prediction precision of dangerous personnel is effectively improved.

Description

Method for improving accident risk prediction precision of traffic hazard personnel

Technical Field

The invention relates to a method for improving the accident risk prediction precision of traffic hazard personnel.

Background

Research shows that the traffic violation and the traffic accident have a relevant relationship, and the attributes and behaviors of drivers, pedestrians and other traffic participants reserved by the traffic violation can provide data support for human factor analysis in traffic safety. The data mining can be carried out according to the safety characteristics of the traffic offenders by utilizing the classification idea and according to the personnel attribute variables.

The traditional classification method is to find a classifier closest to an actual classification function in a space formed by various possible functions, but in the actual situation, only a preferred weak supervision model can be obtained, and the reliability of the model is poor. The ensemble learning algorithm improves the performance of the final model through the combination of the weakly supervised models. However, the complex parameter composition of the integrated learning model brings certain difficulty for improving the model effect. The genetic algorithm can solve the result of global optimum or approximate optimum exactly, and a feasible scheme for improving the precision is provided.

Disclosure of Invention

The invention aims to provide a method for improving the accident risk prediction precision of traffic hazard personnel, which adopts an integrated learning algorithm of optimized sampling and carries out parameter optimization through a genetic algorithm, thereby carrying out quantitative evaluation on the risk degree of the traffic participants with traffic violation records, filling the deficiency of the current quantitative analysis method of the factors of the traffic safety participants and effectively improving the accident risk prediction precision of the traffic hazard personnel.

The technical solution of the invention is as follows:

a method for improving the accident risk prediction precision of traffic hazard personnel comprises the following steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result,

s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data.

S2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified.

S3, constructing an initial traffic participant risk prediction model P0 by adopting an optimized sampling method and an integrated learning algorithm, and determining the sampling number and the SMOTE sampling proportion of the model.

S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and the test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.

S5, constructing an optimal fitting model P for predicting the accident risk of the dangerous personnel according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold;

and S6, inputting the subset data to be identified of the S2 into the model P, and outputting the target object risk.

Further, the ensemble learning algorithm in step S3 includes a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, and a GBDT algorithm.

Further, the optimal sampling method in step S3 includes the specific steps of:

s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size;

s32 sample size n_m＝s₀(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N;

s33, splitting the data set D and the Nm collection Gm into a training set and a test set;

s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; when i is equal to 1, ai is equal to 1, and when i is greater than 1, ai is equal to ai-1+1, the initial value of i is 1, and i is provided with a set upper value limit;

s35, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; when j is 1, bj is 1, when j is greater than 1, bj-1+1, the initial value of j is 1, and j is provided with a set upper value limit; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;

s36, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participants

The model can output a marker value and a risk probability;

s37, modeling with the test set data

Evaluating to obtain model accuracy of different coverage

S38, complementing the sampling sample Nm in the general personnel data subset N into Nm' data according to the lawClassifying the times and inputting the models according to the categories

Counting the misjudgment rate of personnel labels output by models under different coverage rates

Whether S39, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S310, otherwise, entering S34; otherwise, j ═ j +1, go to S35;

s310, detecting whether nm reaches a sampling interval upper limit value S, if so, entering S311, otherwise, returning to S32 if m is m + 1;

s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rate

An optimal random sample number M, SMOTE sample ratio I, J is determined.

Further, the method for assigning the corresponding data label value label based on the classification rule in step S2 specifically includes:

high-risk personnel: one is a person who has illegal records and has serious traffic accident records with major responsibility or all responsibility; the other is the personnel who have illegal records, only have slight accident records and have no less than 2 accident records;

the average person: personnel with illegal records but no accident records;

the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.

Further, the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals.

Further, in step S1, the occurrence condition of the accident-related illegal activity is obtained by a corresponding analysis method, and the type of the violation with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.

Further, in step S1, the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying the discrete variable according to the illegal time characteristics.

The invention has the beneficial effects that:

firstly, the initial fitting model parameters are optimized by adopting a genetic algorithm, and the accident risk prediction precision of traffic hazard personnel is obviously improved.

Compared with traditional classification methods such as decision trees, neural networks and the like, the integrated learning algorithm adopted by the invention has obvious advantages in prediction performance, and ensures the accuracy of the prediction of the traffic accident risk of dangerous personnel.

And thirdly, mining traffic violation data by adopting an optimized ensemble learning algorithm, realizing quantitative evaluation of traffic safety risk degree based on violation records of traffic participants, and outputting the traffic risk degree of personnel by using a model.

Drawings

FIG. 1 is a schematic flow chart of a method for improving the accident risk prediction accuracy of traffic hazard personnel according to an embodiment of the invention.

Fig. 2 is a schematic flow chart of the optimal sampling method adopted in S3 in the embodiment.

FIG. 3 is an explanatory diagram of a data set in the embodiment.

FIG. 4 is a schematic diagram of the propagation process of the genetic algorithm employed in S5 in the example.

Detailed Description

Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

Examples

A method for improving the accident risk prediction precision of traffic hazard personnel comprises the steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result, as shown in figure 1. According to the method, the safety characteristics of traffic travelers are mined in traffic violation data through an integrated learning algorithm, the performance based on an initial model is improved by adopting an optimized sampling method in a sampling link of model construction, and the model parameters are optimized by using a genetic algorithm, so that the accident risk prediction precision of dangerous personnel is effectively improved. The specific method comprises the following steps:

The original traffic violation data and accident data comprise certificate information of related personnel; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals; the occurrence condition of accident-related illegal behaviors is obtained through a corresponding analysis mode, and the illegal type with higher traffic accident influence degree is extracted and used as the data attribute of an illegal data set; the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying according to the illegal time characteristics.

The classification rules are specifically as follows: high risk personnel refer to (1) traffic participants (including motor vehicles, non-motor vehicle drivers, and pedestrians) who have illegal records and have serious traffic accident records with major or complete responsibility; (2) illegal records exist, only slight accident records exist, and the accident records are not less than 2 traffic participants; the general personnel refer to the traffic participants who have illegal records but no accident records; the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.

S3, constructing an initial traffic participant risk prediction model P0 by adopting an optimized sampling method and an integrated learning algorithm, and determining the sampling number and the SMOTE sampling proportion of the model; the ensemble learning algorithm comprises a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm and a GBDT algorithm. As shown in fig. 2, the specific process is as follows:

s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; wherein, when i is 1, ai is 1, when i >1, ai-1+1, i has an upper limit of usually 4;

s35, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; wherein, when j is 1, bj is 1, when j >1, bj is bj-1+1, and j has an upper limit of usually 4; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;

The model can output a marker value and a risk probability;

s37, modeling with the test set data

Evaluating to obtain model accuracy of different coverage

S38, sub-dividing general personnel dataThe data in the sampling sample Nm complement Nm' in the set N are classified according to the illegal times and input into the model according to the classification

An optimal random sample number M, SMOTE sample ratio I, J is determined.

Specific examples

The present embodiment takes a driver of a motor vehicle as an analysis target.

And S1, acquiring the traffic violation records and accident records of 2 years in the area by butting with the database.

The traffic accident with death or serious injury or hit-and-run accident is taken as a serious accident, other accidents are taken as slight accidents, the original accident records are classified according to the serious accident or serious injury or hit-and-run accident, the accident type and driver certificate information are taken as attribute characteristics of a serious accident data set and a slight accident data set, and sample data of the two data sets are obtained.

Further, the illegal original data are preprocessed, and illegal information of the driver is collected and counted, wherein the illegal information comprises accumulated illegal times, illegal types, accumulated deduction scores, average deduction scores (minutes/times), single maximum deduction scores, accumulated fines amount and average fines amount (yuan/times).

The method comprises the steps of performing dimensionality reduction treatment on traffic accident data and illegal original data by adopting a corresponding analysis method, classifying illegal types according to the relevance of the illegal and the type of the accident, and extracting five types with highest relevance as data attributes of an accident risk illegal behavior field, wherein the data attributes are shown in a table 1.

TABLE 1 event-related violation type partitioning

According to the traffic flow operation of the road network of the area where the embodiment is located and the characteristics of the occurrence rule of the traffic violation event, aggregating the time, dividing the analysis time period, and converting the continuous variable into the nominal variable; in another embodiment, the time interval division is performed by other statistical means such as clustering.

Extracting the age, the gender and the province and city code of the driver according to the driver certificate number by the driver characteristic data; and generating an illegal data set according to the information extracted from each link, as shown in table 2.

TABLE 2. partial data of illegal data set

And S2, classifying the full sample I in the illegal data set into two categories, namely a high-risk driver and a common driver. Referring to fig. 4, in a case where a driver who has illegal records and has serious traffic accident records with major responsibility or all responsibility is taken as a high-risk driver, eligible data is classified as a data set D1; dividing the data meeting the conditions into a data set D2 according to another condition that the drivers with illegal records exist, only slight accident records exist and the accident records are not less than 2, and the drivers with the accident records are taken as high-risk drivers; the data set D of the high-risk drivers is D1+ D2. And synthesizing the corresponding data of the drivers with illegal records but no accident records into a general driver data set N.

Accordingly, a high-risk or general data label value label is determined for the data meeting the rule in the illegal data set, and the data subset U which cannot be applied to the classification rule is the data subset to be identified.

S3, constructing an initial vehicle driver risk prediction model P0 by adopting an optimized sampling method and an XgBoost algorithm, and determining the model sampling number and the SMOTE sampling proportion;

s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size; in this embodiment, the sample size of the data set exceeds 84000, the sampling interval S is [200,4000], and the loop step k is 200.

S32 sample size n_m＝s₀(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N; in this embodiment, the initial number of samples is 200.

S33, splitting the data set D and the Nm collection Gm into a training set and a test set; in this embodiment, the split ratio of the training set to the test set is 9: 1.

S34, SMOTE sampling is conducted on the training set, and a high-risk driver data subset D sample expansion proportion ai is set, wherein a1 is 1, ai is ai-1+1, the initial value of i is 1, i is provided with a set value upper limit, and the maximum value of i is 4;

s35, setting an Nm data subset shrinkage proportion bj of a general driver for the high-risk driver sample expansion proportion ai, wherein b1 is 1, bj is bj-1+1, the initial value of j is 1, j is provided with a set value upper limit, and the maximum value of j is 4; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;

s36, training a high-risk driver classifier by applying an XgBoost algorithm, determining model parameters, and realizing a driver traffic accident risk prediction model

The model can output a driver mark value and a risk probability; the model parameters comprise learning rate, the number of weak classifiers, maximum tree depth, node minimum split value, leaf node minimum sample number, leaf node weight sum minimum value, minimum loss function value, row sampling rate, column sampling rate, regularization item 1, regularization item 2, positive and negative weight balance item and early termination training condition;

s37, modeling with the test set data

Evaluating to obtain model accuracy of different coverage

S38, classifying the data in the Nm complement Nm' of the sampling samples in the general driver data subset N according to the illegal times, and inputting the data into the model according to the classification

Counting the misjudgment rate of the driver labels output by models under different coverage rates

Whether S39, j reaches the set maximum value; if yes, judging whether i reaches a set maximum value, if yes, entering S310, otherwise, entering S34 if i is i + 1; otherwise, j ═ j +1, go to S35;

s310, detecting whether nm reaches an interval upper limit S, if so, entering S311, otherwise, returning to S32 if m is m + 1;

An optimal random sample number M, SMOTE sample ratio I, J is determined.

In this embodiment, the comparison analysis is performed by integrating the misjudgment rate, the accuracy and the index stability, and the determined optimal performance model is

I.e., the number of randomly sampled samples is 2400 and the SMOTE ratio is 2: 2.

S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and a test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.

In this embodiment, the accuracy of the test set under 10-fold cross validation is used as a target function, and the parameters of the genetic algorithm are specifically set as: the cross selection probability crossselectivity is 0.8, the variation probability MutationProbability is 0.5, the variation interval Sigma [ -10,10], [ -2,2], [ -2,2], [ -2,2] ], the Population propagation algebra Iteration is 500, and the initial Population number position is 100. The propagation process of the genetic algorithm for parameter optimization is shown in FIG. 4.

S5, constructing an optimal fitting model P for predicting the risk of the vehicle driver according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold.

In the embodiment, the specific parameters of the initial model based on the XgBoost after being optimized by the genetic algorithm are as follows: learning rate learning _ rate _ value is 0.09, weak classifier number n _ estimators _ value is 367, maximum tree depth max _ depth _ value is 4, node minimum split value min _ samples _ split _ value is 10, leaf node minimum sample number min _ samples _ leaf _ value is 6, leaf node weight sum min _ child _ weight _ value is 3, minimum loss function value gamma _ value is 0, line sample rate subsample _ value is 0.45, column sample rate sample _ byte _ value is 0.1, regularization term 1reg _ lambda _ value 11, regularization term 2reg _ value, regularization term 11, positive and negative values of training term _ positive and negative values are terminated in advance by a condition of "weight value 1 _ value _ positive and negative values.

The accuracy of the model after parameter optimization reaches 0.76.

And S6, inputting the subset data to be identified of the S2 into the model P, and outputting the risk degree of the driver. Some of the results are shown in Table 3.

Table 3 analysis results of the risk degree of high-risk drivers using the method of the present invention

Claims

1. A method for improving the accident risk prediction precision of traffic hazard personnel is characterized by comprising the following steps: the method comprises the following steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result, wherein the method specifically comprises the following steps:

s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data;

s2, classifying the illegal data set into two categories according to the serious traffic accident record of the serious accident data set and the light accident record of the light accident data set, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified;

s3, constructing an initial dangerous personnel accident risk prediction model P by adopting an optimized sampling method and an ensemble learning algorithm₀Determining the sampling number and SMOTE sampling proportion of the model; the optimal sampling method in step S3 includes the following steps:

s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N;

s32 sample size n_m＝s₀+(m-1)·k，s₀Is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly sampling N samples from the data set N_mSample N of_m；

S33, data sets D and N_mCollection G_mSplitting the training set into a training set and a test set;

s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion a of the high-risk personnel data subset D is set_i(ii) a Wherein, when i is 1, a_iWhen i is equal to 1>1 time, a_i＝a_i-1The initial value of +1, i is 1, and i is provided with a set upper value limit;

s35 sample expansion ratio a for high-risk personnel_iSetting general person N_mData subset reduction scale b_j(ii) a Wherein, when j is 1, b_jWhen j is equal to 1>1 time, b_j＝b_j-1The initial value of +1, j is 1, and j is provided with a set upper value limit; sampling ratio a for SMOTE_i:b_jCarrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;

The model can output a marker value and a risk probability;

s37, modeling with the test set data

Evaluating to obtain different coverageModel accuracy of cap rate

S38, sampling samples N in the general personnel data subset N_mComplement N_m' Indata is classified according to the number of violations and input into the model by category

s310, detecting n_mIf the sampling interval upper limit value S is reached, the process goes to S311, otherwise, m is m +1, and the process returns to S32;

Determining an optimal random sampling number M, SMOTE sampling ratio I, J;

s4, adopting genetic algorithm to pair the model P₀Performing performance optimization, wherein an optimization objective function of the performance optimization is used for predicting accuracy maximization for a test set, and a test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, variation intervals, population breeding algebra and initial population quantity;

and S6, inputting the subset data to be identified in the step S2 into the model P, and outputting the target object risk.

2. The method for improving the accident risk prediction accuracy of traffic hazard personnel according to claim 1, wherein the ensemble learning algorithm in step S3 comprises a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, and a GBDT algorithm.

3. The method for improving the accident risk prediction accuracy of the traffic hazard personnel according to claim 1, wherein the method for assigning the corresponding data label value label based on the classification rule in the step S2 specifically comprises:

the average person: personnel with illegal records but no accident records;

and the data which do not meet the discrimination conditions of the high-risk personnel and the common personnel form a subset to be recognized.

4. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 1, wherein: the original traffic violation data and accident data in step S1 include personnel certificate information; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident illegal behavior occurrence conditions and illegal occurrence time intervals.

5. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 1, wherein: in step S1, the occurrence of the accident illegal activity is obtained by a corresponding analysis method, and the type of the illegal activity with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.

6. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 4, wherein: in step S1, the time-continuous variable is converted into a discrete variable, and the discrete variable is classified according to the characteristics of the time of violation.