CN108596409A

CN108596409A - The method for promoting traffic hazard personnel's accident risk prediction precision

Info

Publication number: CN108596409A
Application number: CN201810783017.7A
Authority: CN
Inventors: 刘林; 陈凝; 吕伟韬; 马党生
Original assignee: JIANGSU INTELLIGENT TRANSPORTATION SYSTEMS Co Ltd
Current assignee: JIANGSU INTELLIGENT TRANSPORTATION SYSTEMS Co Ltd
Priority date: 2018-07-16
Filing date: 2018-07-16
Publication date: 2018-09-28
Anticipated expiration: 2038-07-16
Also published as: CN108596409B

Abstract

The present invention provides a kind of method promoting traffic hazard personnel's accident risk prediction precision, traffic violation data and casualty data sample are obtained with the methods of sampling of optimization, traffic participant street accidents risks prediction model is trained using Ensemble Learning Algorithms, and model optimization is carried out by genetic algorithm.The present invention excavates the security feature of traffic trip person with Ensemble Learning Algorithms in traffic violation data, it is improved using the optimization methods of sampling in the sampling link of model construction and is based on initial model performance, and Model Parameter Optimization is carried out with genetic algorithm, effectively promote high-risk personnel accident risk prediction precision.

Description

The method for promoting traffic hazard personnel's accident risk prediction precision

Technical field

The present invention relates to a kind of methods promoting traffic hazard personnel's accident risk prediction precision.

Background technology

Some researches show that there are correlativity, driver, the pedestrians of traffic offence reservation between traffic offence and traffic accident Equal traffic participants attribute can provide data supporting with behavior for the human factor analysis in traffic safety.The excavation of data can With classificating thought, the security feature of traffic offence personnel is excavated according to personnel attribute variable.

Traditional sorting technique is that one is found in the space being made of various possible functions at one closest to reality The grader of classification function, but be typically only capable to obtain the Weakly supervised model of preference under actual conditions, the reliability of model is bad. Ensemble Learning Algorithms improve the performance of final mask by the combination of Weakly supervised model.But the parameter of integrated study model complexity Composition carrys out certain difficulty for modelling effect elevator belt.And genetic algorithm be able to solve global optimum or near-optimization well As a result, providing the feasible scheme for promoting precision.

Invention content

The object of the present invention is to provide a kind of methods promoting traffic hazard personnel's accident risk prediction precision, using optimization The Ensemble Learning Algorithms of sampling, and parameter optimization is carried out by genetic algorithm, to there are the traffic of traffic law violation recording ginsengs Qualitative assessment is carried out with person's danger level, is filled up currently in the missing of traffic safety participant's factor quantitative analysis method, and effectively Promote high-risk personnel accident risk prediction precision.

Technical solution of the invention is：

A method of traffic hazard personnel's accident risk prediction precision being promoted, obtaining traffic with the methods of sampling of optimization disobeys Method data and casualty data sample train traffic participant street accidents risks prediction model, into one using Ensemble Learning Algorithms Step carries out model optimization to promote prediction result accuracy by genetic algorithm, includes the following steps,

S1, based on original traffic violation data and casualty data, it is structure unlawful data collection, major accident data set, light Micro- casualty data collection.

S2, unlawful data collection two is classified, i.e. high-risk personnel, general staff, data markers value is determined according to classifying rules Unlawful data collection is divided into high-risk personnel data subset D, general staff's data subset N and subset U to be identified by label accordingly.

S3, initial traffic participant danger level prediction model P0 is built using the optimization methods of sampling and Ensemble Learning Algorithms, Determine model sampling number, SMOTE sampling proportions.

S4, performance optimization is carried out to model P0 using genetic algorithm, optimization object function is test set prediction accuracy It maximizes, wherein test set Accuracy Analysis method is that k rolls over cross validation；Genetic algorithm parameter is set, object function is made to restrain Speed is fast, avoids shaking the case where not restraining；Wherein genetic algorithm parameter includes cross selection probability, mutation probability, region of variability Between, Population breeding algebraically, initial population quantity.

S5, the target optimal model parameters exported according to genetic algorithm, build the optimal of personnel at risk's accident risk prediction Model of fit P determines model test coverage recall and Model checking threshold value；

S6, the subset data input model P to be identified by S2 export target object danger level.

Further, the Ensemble Learning Algorithms described in step S3 include random forests algorithm, AdaBoost algorithms, XgBoost algorithms, GBDT algorithms.

Further, the optimization methods of sampling described in step S3 the specific steps are：

S31, sampling interval S is set according to data set N sample sizes and recycles step-length k, section coboundary s is usually no more than Total sample size 25%；

S32, sample size n_m=s₀+ (m-1) k, s0 are sampling interval lower limiting value, and m is cycle-index, initial value 1；From number According to integrating in N randomly drawing sample amount as the sample Nm of nm；

S33, data set D and Nm intersection Gm is split as training set and test set；

S34, SMOTE sampling is carried out to training set, setting high-risk personnel data subset D expands sample ratio ai；Wherein, work as i=1 When, ai=1 works as i>When 1, ai=ai-1+1, i initial values are the value upper limit that 1, i is equipped with setting；

S35, expand sample ratio ai, setting general staff's Nm data subset contracting sample ratios bj for high-risk personnel；Wherein, work as j When=1, bj=1 works as j>When 1, bj=bj-1+1, j initial values are the value upper limit that 1, j is equipped with setting；For SMOTE sampling proportions ai:Bj is trained expansion sample, contracting the sample processing of two class exemplars in collection, the training sample set as grader；

S36, the training that high-risk personnel grader is carried out with Ensemble Learning Algorithms, determine model parameter, realize traffic ginseng With person's street accidents risks prediction modelFitting, model being capable of output token value and risk probability；

S37, model is carried out with test set dataAssessment, obtains the model accuracy of different coverage rates

S38, the interior data of sampling samples Nm supplementary sets Nm ' in general staff's data subset N are classified according to illegal number, and Category input modelPeople Tab's False Rate of different coverage rate drags output is counted

Whether S39, j reach the value upper limit；If so, judging whether i reaches the value upper limit, if so, into S310, otherwise I=i+1 is transferred to S34；Otherwise, j=j+1 goes to S35；

Whether S310, detection nm reach sampling interval upper limit value s, if then entering S311, otherwise m=m+1, returns to S32；

S311, the model by model accuracy, False Rate analysis with optimal performanceDetermine optimal random sampling Number M, SMOTE sampling proportion I, J.

Further, the method that corresponding data mark value label is assigned based on classifying rules described in step S2 is specific For：

High-risk personnel：One kind for there are it is illegal record and exist take the main responsibility or the severe traffic accidents of fullliability note The personnel of record；Another kind of is there are illegal record, and there is only minor accident records, and accident record is not less than 2 personnel；

General staff：There are the personnel of illegal record but zero defects record；

The data for being unsatisfactory for above-mentioned criterion constitute subset to be identified.

Further, traffic violation data original in step S1 and casualty data include related personnel's certificate information；It is right Illegal record is collected, obtains unlawful data collection after processing operation of classifying；Unlawful data collection is illegal record bulk sample notebook data, Unlawful data collection information includes personnel's passport NO., illegal number, illegal type, deduction of points fine situation, the related illegal row of accident For a situation arises, the illegal period of right time.

Further, a situation arises is obtained by correspondence analysis mode for the illegal activities of accident correlation in step S1, and extracts The higher Criminal type of traffic accident influence degree, the data attribute as unlawful data collection.

Further, it is discrete variable, root that the illegal period of right time described in step S1, which is by Continuous-time variables transformations, Classify according to illegal temporal characteristics.

The beneficial effects of the invention are as follows：

One, present invention employs genetic algorithms optimizes initial fitted model parameters, has been obviously improved traffic hazard Personnel's accident risk prediction precision.

Two, the Ensemble Learning Algorithms that the present invention uses, compared to conventional sorting methods such as decision tree, neural networks, pre- Surveying has significant advantage in performance, ensure that the accuracy of personnel at risk's street accidents risks prediction.

Three, the present invention excavates traffic violation data using the Ensemble Learning Algorithms of optimization, realizes and is joined based on traffic With the traffic safety risk qualitative assessment of the illegal record of person, model can export the traffic hazard degree of personnel.

Description of the drawings

Fig. 1 is the method flow schematic diagram that the embodiment of the present invention promotes traffic hazard personnel's accident risk prediction precision.

Fig. 2 is the idiographic flow schematic diagram for the optimization methods of sampling that S3 is used in embodiment.

Fig. 3 is that data set illustrates schematic diagram in embodiment.

Fig. 4 is the genetic algorithm reproductive process schematic diagram that S5 is used in embodiment.

Specific implementation mode

The preferred embodiment that the invention will now be described in detail with reference to the accompanying drawings.

Embodiment

A method of traffic hazard personnel's accident risk prediction precision being promoted, obtaining traffic with the methods of sampling of optimization disobeys Method data and casualty data sample train traffic participant street accidents risks prediction model, into one using Ensemble Learning Algorithms Step carries out model optimization to promote prediction result accuracy, such as Fig. 1 by genetic algorithm.Embodiment method is with Ensemble Learning Algorithms The security feature that traffic trip person is excavated in traffic violation data uses the optimization methods of sampling in the sampling link of model construction It improves and is based on initial model performance, and Model Parameter Optimization is carried out with genetic algorithm, effectively promote high-risk personnel accident risk Precision of prediction.Specifically method flow is：

Wherein, original traffic violation data and casualty data include related personnel's certificate information；Illegal record is carried out Collect, obtain unlawful data collection after processing operation of classifying；Unlawful data collection is illegal record bulk sample notebook data, unlawful data collection letter Breath includes personnel's passport NO., illegal number, illegal type, deduction of points fine situation, a situation arises for the illegal activities of accident correlation, disobeys The method period of right time；A situation arises is obtained by correspondence analysis mode for the illegal activities of accident correlation, and extracts traffic accident and influence journey Higher Criminal type is spent, the data attribute as unlawful data collection；The illegal period of right time is by Continuous-time variables transformations For discrete variable, classified according to illegal temporal characteristics.

Wherein classifying rules is specially：High-risk personnel refers to (1) there are illegal record and presence is taken the main responsibility or whole duties The traffic participant (including motor vehicle, non-motor vehicle driver and pedestrian) for the severe traffic accidents record appointed；(2) there are separated Method records, and there is only minor accident records, and accident record is not less than 2 traffic participants；General staff refers to that there are illegal The traffic participant of record but zero defects record；The data for being unsatisfactory for above-mentioned criterion constitute subset to be identified.

S3, initial traffic participant danger level prediction model P0 is built using the optimization methods of sampling and Ensemble Learning Algorithms, Determine model sampling number, SMOTE sampling proportions；Wherein Ensemble Learning Algorithms include random forests algorithm, AdaBoost algorithms, XgBoost algorithms, GBDT algorithms.As shown in Fig. 2, detailed process is：

S33, data set D and Nm intersection Gm is split as training set and test set；

S34, SMOTE sampling is carried out to training set, setting high-risk personnel data subset D expands sample ratio ai；Wherein, work as i=1 When, ai=1 works as i>When 1, ai=ai-1+1, the i value upper limits are usually 4；

S35, expand sample ratio ai, setting general staff's Nm data subset contracting sample ratios bj for high-risk personnel；Wherein, work as j When=1, bj=1 works as j>When 1, bj=bj-1+1, the j value upper limits are usually 4；For SMOTE sampling proportions ai:Bj is instructed Practice expansion sample, contracting the sample processing of two class exemplars in collection, the training sample set as grader；

Specific example

The present embodiment artificially analyzes object with motor vehicle driving.

S1, traffic law violation recording and accident record by obtaining 2 years in region with connection.

Killed or wounded will occur seriously or the traffic accident of hit-and-run occurs as major accident, other accidents conduct Minor accident accordingly classifies to original accident record, and using accident pattern and driver's certificate information as serious thing Therefore the attributive character of data set and minor accident data set, obtain two data set sample datas.

Further, illegal initial data is pre-processed, the illegal information of driver is carried out to collect statistics, including Add up illegal number, illegal type, accumulated deduction score value, score value (point/time) of averagely deducting points, single maximum deduction of points score value, add up Impose a fine the amount of money, the average penalty amount of money (member/time).

Dimension-reduction treatment is carried out to traffic accident data and illegal initial data using correspondence analysis, according to illegal and accident Correlation in type classifies to illegal type, and it is illegal as accident risk to extract wherein highest five class of correlation The data attribute of behavior field, as shown in table 1.

1. accident correlation Criminal type dividing condition of table

According to the traffic flow operation of embodiment region road network and traffic offence event pests occurrence rule feature, by the time It is polymerize, and the Partition Analysis period, converts continuous variable to nominal type variable；In another embodiment, by poly- Other statisticals such as class carry out Time segments division.

Driver's characteristic is then encoded according to extraction driver's age, gender, affiliated provinces and cities in driver's passport NO.； Unlawful data collection is generated according to the information of above-mentioned each link extraction, as shown in table 2.

2. unlawful data collection partial data of table

S2, high-risk driver and the classification of general driver two are carried out to this I of bulk sample in unlawful data collection.Such as Fig. 4, there will be Illegal record and presence are taken the main responsibility or the driver of the severe traffic accidents of fullliability record is as high-risk driver's A kind of situation, qualified data divide data set D1 into；There will be illegal record, there is only minor accident record, and accident Another situation of driver of the record not less than 2 as high-risk driver, qualified data divide data set D2 into；It is high Endanger driver's data set D=D1+D2.There are driver's corresponding datas of illegal record but zero defects record to synthesize general driver Data set N.

The data for meeting rule are concentrated to determine high-risk or general data markers value label unlawful data accordingly, in addition It can not be suitable for the data subset U=I-N-D of this classifying rules, then be data subset to be identified.

S3, initial vehicle driver danger level prediction model P0 is built using the optimization methods of sampling and XgBoost algorithms, really Cover half type sampling number, SMOTE sampling proportions；

S31, sampling interval S is set according to data set N sample sizes and recycles step-length k, section coboundary s is usually no more than Total sample size 25%；In the present embodiment, data set sample size is more than 84000, sampling interval S=[200,4000], cycle step-length k It is 200.

S32, sample size n_m=s₀+ (m-1) k, s0 are sampling interval lower limiting value, and m is cycle-index, initial value 1；From number According to integrating in N randomly drawing sample amount as the sample Nm of nm；In the present embodiment, initial sample number is 200.

S33, data set D and Nm intersection Gm is split as training set and test set；In the present embodiment, training set and test set Primary contract be 9:1.

S34, SMOTE sampling is carried out to training set, high-risk driver's data subset D is set and expands sample ratio ai, wherein a1= 1, ai=ai-1+1, i initial value are the value upper limit that 1, i is equipped with setting, and i maximum values are 4；

S35, sample ratio ai is expanded for high-risk driver, general driver Nm data subsets contracting sample ratio bj is set, wherein B1=1, bj=bj-1+1, j initial value are the value upper limit that 1, j is equipped with setting, and j maximum values are 4；For SMOTE sampling proportions ai: Bj is trained expansion sample, contracting the sample processing of two class exemplars in collection, the training sample set as grader；

S36, the training that high-risk driver's grader is carried out with XgBoost algorithms determine model parameter, realize driver Street accidents risks prediction modelFitting, model can export driver's mark value and risk probability；Model parameter packet Include learning rate, Weak Classifier number, maximal tree depth, node minimum split values, leaf node smallest sample number, leaf node weights sum Minimum value, minimize loss function value, line sampling rate, row sampling rate, regularization term 1, regularization term 2, positive and negative Weight balance item, Training condition is terminated in advance；

S38, the interior data of sampling samples Nm supplementary sets Nm ' in general driver's data subset N are classified according to illegal number, And category input modelDriver's label False Rate of different coverage rate drags output is counted

Whether S39, j reach setting maximum value；If so, judge whether i reaches setting maximum value, if so, into S310, Otherwise i=i+1 is transferred to S34；Otherwise, j=j+1 goes to S35；

Whether S310, detection nm reach section upper limit s, if then entering S311, otherwise m=m+1, returns to S32；

In the present embodiment, comprehensive False Rate, accuracy and index stability compare and analyze, determining optimal performance mould Type isIt is 2 that i.e. random sampling sample number, which is 2400, SMOTE ratios,:2.

S4, performance optimization is carried out to model P0 using genetic algorithm, optimization object function be test set precision of prediction most Bigization, wherein test set precision analytical method are that k rolls over cross validation；Genetic algorithm parameter is set, object function convergence rate is made Soon, the case where avoiding concussion from not restraining；Wherein genetic algorithm parameter includes cross selection probability, mutation probability, variation section, kind Group's reproductive order of generation, initial population quantity.

In the embodiment, use the test set precision under 10 folding cross validations for object function, genetic algorithm parameter is specific It is set as：Cross selection probability CrossoverProbaiblity=0.8, mutation probability MutationProbability= 0.5, variation section Sigma=[[- 10,10], [- 2,2], [- 2,2], [- 2,2], [- 2,2]], Population breeding algebraically Iteration=500, initial population quantity Population=100.Genetic algorithm reproductive process such as Fig. 4 institutes of parameter optimization Show.

S5, the target optimal model parameters exported according to genetic algorithm, structure vehicle drive people's danger level are predicted optimal Model of fit P determines model test coverage recall and Model checking threshold value.

In embodiment, the design parameter based on the initial model of XgBoost after genetic algorithm optimization is：Learning rate Learning_rate_value=0.09, Weak Classifier number n_estimators_value=367, maximal tree depth max_ Depth_value=4, node minimum split values min_samples_split_value=10, leaf node smallest sample number min_ Samples_leaf_value=6, leaf node weights sum minimum value min_child_weight_value=3 minimize damage Lose functional value gamma_value=0, line sampling rate subsample_value=0.45, row sampling rate colsample_ Bytree_value=0.1, regularization term 1reg_lambda_value=11, regularization term 2reg_alpha_value=11, Positive and negative Weight balance item scale_pos_weight_value=1, training condition early_stopping_ is terminated in advance Rounds_value=37.

Model accuracy after parameter optimization reaches 0.76.

S6, the subset data input model P to be identified by S2 export driver's danger level.Partial results are as shown in table 3.

Table 3. uses high-risk driver's hazard degree analysis result of the method for the present invention

Claims

1. a kind of method promoting traffic hazard personnel's accident risk prediction precision, it is characterised in that：With the methods of sampling of optimization Traffic violation data and casualty data sample are obtained, using Ensemble Learning Algorithms training traffic participant street accidents risks prediction Model further carries out model optimization to promote prediction result accuracy by genetic algorithm, specifically includes following steps：

S1, based on original traffic violation data and casualty data, structure unlawful data collection, major accident data set, slight thing Therefore data set；

S2, unlawful data collection two is classified, i.e. high-risk personnel, general staff, data markers value is determined according to classifying rules Unlawful data collection is divided into high-risk personnel data subset D, general staff's data subset N and subset U to be identified by label accordingly；

S3, initial personnel at risk's accident risk prediction model P is built using the optimization methods of sampling and Ensemble Learning Algorithms₀, determine mould Type sampling number, SMOTE sampling proportions；

S4, using genetic algorithm to model P₀Performance optimization is carried out, optimization object function is that test set prediction accuracy is maximum Change, wherein test set Accuracy Analysis method is that k rolls over cross validation；Genetic algorithm parameter is set, object function convergence rate is made Soon, the case where avoiding concussion from not restraining；Wherein genetic algorithm parameter includes cross selection probability, mutation probability, variation section, kind Group's reproductive order of generation, initial population quantity；

S5, the target optimal model parameters exported according to genetic algorithm, build the optimal fitting of personnel at risk's accident risk prediction Model P determines model test coverage recall and Model checking threshold value；

S6, the subset data input model P to be identified by step S2 export target object danger level.

2. the method for promoting traffic hazard personnel's accident risk prediction precision as described in claim 1, which is characterized in that step Ensemble Learning Algorithms described in S3 include random forests algorithm, AdaBoost algorithms, XgBoost algorithms, GBDT algorithms.

3. the method for promoting traffic hazard personnel's accident risk prediction precision as described in claim 1, which is characterized in that step The optimization methods of sampling described in S3 the specific steps are：

S31, sampling interval S and cycle step-length k is set according to data set N sample sizes；

S32, sample size n_m=s₀+ (m-1) k, s₀For sampling interval lower limiting value, m is cycle-index, initial value 1；From data set N Middle randomly drawing sample amount is n_mSample N_m；

S33, by data set D and N_mIntersection G_mIt is split as training set and test set；

S34, SMOTE sampling is carried out to training set, setting high-risk personnel data subset D expands sample ratio a_i；Wherein, as i=1, a_i =1, work as i>When 1, a_i=a_i-1+ 1, i initial value are the value upper limit that 1, i is equipped with setting；

S35, sample ratio a is expanded for high-risk personnel_i, setting general staff N_mData subset contracting sample ratio b_j；Wherein, as j=1, b_j=1, work as j>When 1, b_j=b_j-1+ 1, j initial value are the value upper limit that 1, j is equipped with setting；For SMOTE sampling proportions a_i:b_j, into Expansion sample, contracting the sample processing of two class exemplars, the training sample set as grader in row training set；

S36, the training that high-risk personnel grader is carried out with Ensemble Learning Algorithms, determine model parameter, realize traffic participant Street accidents risks prediction modelFitting, model being capable of output token value and risk probability；

S38, by the sampling samples N in general staff's data subset N_mSupplementary set N_m' interior data are classified according to illegal number, and press class Other input modelPeople Tab's False Rate of different coverage rate drags output is counted

Whether S39, j reach the value upper limit；If so, judge whether i reaches the value upper limit, if so, into S310, otherwise i=i + 1, it is transferred to S34；Otherwise, j=j+1 goes to S35；

S310, detection n_mWhether sampling interval upper limit value s is reached, if then entering S311, otherwise m=m+1, returns to S32；

S311, the model by model accuracy, False Rate analysis with optimal performanceDetermine optimal random sampling numbers M, SMOTE sampling proportions I, J.

4. the method for promoting traffic hazard personnel's accident risk prediction precision as described in claim 1, which is characterized in that step The method for assigning corresponding data mark value label based on classifying rules described in S2 is specially：

High-risk personnel：One kind for there are it is illegal record and exist take the main responsibility or the severe traffic accidents of fullliability record Personnel；Another kind of is there are illegal record, and there is only minor accident records, and accident record is not less than 2 personnel；

5. the method for promoting traffic hazard personnel's accident risk prediction precision as described in claim 1, it is characterised in that：Step Original traffic violation data and casualty data include related personnel's certificate information in S1；Illegal record is collected, is classified Unlawful data collection is obtained after processing operation；Unlawful data collection is illegal record bulk sample notebook data, and unlawful data collection information includes people Member passport NO., illegal number, illegal type, deduction of points fine situation, a situation arises for the illegal activities of accident correlation, illegal generation when Section.

6. the method for promoting traffic hazard personnel's accident risk prediction precision as described in claim 1, it is characterised in that：Step A situation arises is obtained by correspondence analysis mode for the illegal activities of accident correlation in S1, and it is higher to extract traffic accident influence degree Criminal type, the data attribute as unlawful data collection.

7. the method for promoting traffic hazard personnel's accident risk prediction precision as described in claim 1, it is characterised in that：Step It is discrete variable that the illegal period of right time described in S1, which is by Continuous-time variables transformations, is divided according to illegal temporal characteristics Class.