CN108596409B - Method for improving accident risk prediction precision of traffic hazard personnel - Google Patents
Method for improving accident risk prediction precision of traffic hazard personnel Download PDFInfo
- Publication number
- CN108596409B CN108596409B CN201810783017.7A CN201810783017A CN108596409B CN 108596409 B CN108596409 B CN 108596409B CN 201810783017 A CN201810783017 A CN 201810783017A CN 108596409 B CN108596409 B CN 108596409B
- Authority
- CN
- China
- Prior art keywords
- accident
- data
- personnel
- illegal
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 62
- 238000005070 sampling Methods 0.000 claims abstract description 53
- 230000002068 genetic effect Effects 0.000 claims abstract description 29
- 206010039203 Road traffic accident Diseases 0.000 claims abstract description 20
- 238000013058 risk prediction model Methods 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 27
- 238000012360 testing method Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 13
- 238000004458 analytical method Methods 0.000 claims description 12
- 230000006399 behavior Effects 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 6
- 238000002790 cross-validation Methods 0.000 claims description 5
- 238000009395 breeding Methods 0.000 claims description 4
- 230000001488 breeding effect Effects 0.000 claims description 4
- 230000008602 contraction Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 4
- 230000010355 oscillation Effects 0.000 claims description 4
- 239000003550 marker Substances 0.000 claims description 3
- 238000007637 random forest analysis Methods 0.000 claims description 3
- GHVNFZFCNZKVNT-UHFFFAOYSA-M decanoate Chemical compound CCCCCCCCCC([O-])=O GHVNFZFCNZKVNT-UHFFFAOYSA-M 0.000 claims 1
- 238000010276 construction Methods 0.000 abstract description 2
- 230000035772 mutation Effects 0.000 description 3
- 208000027418 Wounds and injury Diseases 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 230000006378 damage Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 208000014674 injury Diseases 0.000 description 2
- 238000011158 quantitative evaluation Methods 0.000 description 2
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Business, Economics & Management (AREA)
- Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Tourism & Hospitality (AREA)
- Human Resources & Organizations (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Marketing (AREA)
- Development Economics (AREA)
- General Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- Quality & Reliability (AREA)
- Game Theory and Decision Science (AREA)
- Entrepreneurship & Innovation (AREA)
- Operations Research (AREA)
- Educational Administration (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Traffic Control Systems (AREA)
Abstract
The invention provides a method for improving the accident risk prediction precision of traffic hazard personnel, which obtains traffic violation data and accident data samples by an optimized sampling method, trains a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and optimizes the model by a genetic algorithm. According to the method, the safety characteristics of traffic travelers are mined in traffic violation data by an integrated learning algorithm, the performance based on an initial model is improved by adopting an optimized sampling method in a sampling link of model construction, and the model parameters are optimized by using a genetic algorithm, so that the accident risk prediction precision of dangerous personnel is effectively improved.
Description
Technical Field
The invention relates to a method for improving the accident risk prediction precision of traffic hazard personnel.
Background
Research shows that the traffic violation and the traffic accident have a relevant relationship, and the attributes and behaviors of drivers, pedestrians and other traffic participants reserved by the traffic violation can provide data support for human factor analysis in traffic safety. The data mining can be carried out according to the safety characteristics of the traffic offenders by utilizing the classification idea and according to the personnel attribute variables.
The traditional classification method is to find a classifier closest to an actual classification function in a space formed by various possible functions, but in the actual situation, only a preferred weak supervision model can be obtained, and the reliability of the model is poor. The ensemble learning algorithm improves the performance of the final model through the combination of the weakly supervised models. However, the complex parameter composition of the integrated learning model brings certain difficulty for improving the model effect. The genetic algorithm can solve the result of global optimum or approximate optimum exactly, and a feasible scheme for improving the precision is provided.
Disclosure of Invention
The invention aims to provide a method for improving the accident risk prediction precision of traffic hazard personnel, which adopts an integrated learning algorithm of optimized sampling and carries out parameter optimization through a genetic algorithm, thereby carrying out quantitative evaluation on the risk degree of the traffic participants with traffic violation records, filling the deficiency of the current quantitative analysis method of the factors of the traffic safety participants and effectively improving the accident risk prediction precision of the traffic hazard personnel.
The technical solution of the invention is as follows:
a method for improving the accident risk prediction precision of traffic hazard personnel comprises the following steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result,
s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data.
S2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified.
S3, constructing an initial traffic participant risk prediction model P0 by adopting an optimized sampling method and an integrated learning algorithm, and determining the sampling number and the SMOTE sampling proportion of the model.
S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and the test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.
S5, constructing an optimal fitting model P for predicting the accident risk of the dangerous personnel according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold;
and S6, inputting the subset data to be identified of the S2 into the model P, and outputting the target object risk.
Further, the ensemble learning algorithm in step S3 includes a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, and a GBDT algorithm.
Further, the optimal sampling method in step S3 includes the specific steps of:
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size;
s32 sample size nm=s0(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N;
s33, splitting the data set D and the Nm collection Gm into a training set and a test set;
s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; when i is equal to 1, ai is equal to 1, and when i is greater than 1, ai is equal to ai-1+1, the initial value of i is 1, and i is provided with a set upper value limit;
s35, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; when j is 1, bj is 1, when j is greater than 1, bj-1+1, the initial value of j is 1, and j is provided with a set upper value limit; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participantsThe model can output a marker value and a risk probability;
S38, complementing the sampling sample Nm in the general personnel data subset N into Nm' data according to the lawClassifying the times and inputting the models according to the categoriesCounting the misjudgment rate of personnel labels output by models under different coverage rates
Whether S39, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S310, otherwise, entering S34; otherwise, j ═ j +1, go to S35;
s310, detecting whether nm reaches a sampling interval upper limit value S, if so, entering S311, otherwise, returning to S32 if m is m + 1;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rateAn optimal random sample number M, SMOTE sample ratio I, J is determined.
Further, the method for assigning the corresponding data label value label based on the classification rule in step S2 specifically includes:
high-risk personnel: one is a person who has illegal records and has serious traffic accident records with major responsibility or all responsibility; the other is the personnel who have illegal records, only have slight accident records and have no less than 2 accident records;
the average person: personnel with illegal records but no accident records;
the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.
Further, the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals.
Further, in step S1, the occurrence condition of the accident-related illegal activity is obtained by a corresponding analysis method, and the type of the violation with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.
Further, in step S1, the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying the discrete variable according to the illegal time characteristics.
The invention has the beneficial effects that:
firstly, the initial fitting model parameters are optimized by adopting a genetic algorithm, and the accident risk prediction precision of traffic hazard personnel is obviously improved.
Compared with traditional classification methods such as decision trees, neural networks and the like, the integrated learning algorithm adopted by the invention has obvious advantages in prediction performance, and ensures the accuracy of the prediction of the traffic accident risk of dangerous personnel.
And thirdly, mining traffic violation data by adopting an optimized ensemble learning algorithm, realizing quantitative evaluation of traffic safety risk degree based on violation records of traffic participants, and outputting the traffic risk degree of personnel by using a model.
Drawings
FIG. 1 is a schematic flow chart of a method for improving the accident risk prediction accuracy of traffic hazard personnel according to an embodiment of the invention.
Fig. 2 is a schematic flow chart of the optimal sampling method adopted in S3 in the embodiment.
FIG. 3 is an explanatory diagram of a data set in the embodiment.
FIG. 4 is a schematic diagram of the propagation process of the genetic algorithm employed in S5 in the example.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Examples
A method for improving the accident risk prediction precision of traffic hazard personnel comprises the steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result, as shown in figure 1. According to the method, the safety characteristics of traffic travelers are mined in traffic violation data through an integrated learning algorithm, the performance based on an initial model is improved by adopting an optimized sampling method in a sampling link of model construction, and the model parameters are optimized by using a genetic algorithm, so that the accident risk prediction precision of dangerous personnel is effectively improved. The specific method comprises the following steps:
s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data.
The original traffic violation data and accident data comprise certificate information of related personnel; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals; the occurrence condition of accident-related illegal behaviors is obtained through a corresponding analysis mode, and the illegal type with higher traffic accident influence degree is extracted and used as the data attribute of an illegal data set; the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying according to the illegal time characteristics.
S2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified.
The classification rules are specifically as follows: high risk personnel refer to (1) traffic participants (including motor vehicles, non-motor vehicle drivers, and pedestrians) who have illegal records and have serious traffic accident records with major or complete responsibility; (2) illegal records exist, only slight accident records exist, and the accident records are not less than 2 traffic participants; the general personnel refer to the traffic participants who have illegal records but no accident records; the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.
S3, constructing an initial traffic participant risk prediction model P0 by adopting an optimized sampling method and an integrated learning algorithm, and determining the sampling number and the SMOTE sampling proportion of the model; the ensemble learning algorithm comprises a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm and a GBDT algorithm. As shown in fig. 2, the specific process is as follows:
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size;
s32 sample size nm=s0(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N;
s33, splitting the data set D and the Nm collection Gm into a training set and a test set;
s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; wherein, when i is 1, ai is 1, when i >1, ai-1+1, i has an upper limit of usually 4;
s35, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; wherein, when j is 1, bj is 1, when j >1, bj is bj-1+1, and j has an upper limit of usually 4; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participantsThe model can output a marker value and a risk probability;
S38, sub-dividing general personnel dataThe data in the sampling sample Nm complement Nm' in the set N are classified according to the illegal times and input into the model according to the classificationCounting the misjudgment rate of personnel labels output by models under different coverage rates
Whether S39, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S310, otherwise, entering S34; otherwise, j ═ j +1, go to S35;
s310, detecting whether nm reaches a sampling interval upper limit value S, if so, entering S311, otherwise, returning to S32 if m is m + 1;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rateAn optimal random sample number M, SMOTE sample ratio I, J is determined.
S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and the test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.
S5, constructing an optimal fitting model P for predicting the accident risk of the dangerous personnel according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold;
and S6, inputting the subset data to be identified of the S2 into the model P, and outputting the target object risk.
Specific examples
The present embodiment takes a driver of a motor vehicle as an analysis target.
And S1, acquiring the traffic violation records and accident records of 2 years in the area by butting with the database.
The traffic accident with death or serious injury or hit-and-run accident is taken as a serious accident, other accidents are taken as slight accidents, the original accident records are classified according to the serious accident or serious injury or hit-and-run accident, the accident type and driver certificate information are taken as attribute characteristics of a serious accident data set and a slight accident data set, and sample data of the two data sets are obtained.
Further, the illegal original data are preprocessed, and illegal information of the driver is collected and counted, wherein the illegal information comprises accumulated illegal times, illegal types, accumulated deduction scores, average deduction scores (minutes/times), single maximum deduction scores, accumulated fines amount and average fines amount (yuan/times).
The method comprises the steps of performing dimensionality reduction treatment on traffic accident data and illegal original data by adopting a corresponding analysis method, classifying illegal types according to the relevance of the illegal and the type of the accident, and extracting five types with highest relevance as data attributes of an accident risk illegal behavior field, wherein the data attributes are shown in a table 1.
TABLE 1 event-related violation type partitioning
According to the traffic flow operation of the road network of the area where the embodiment is located and the characteristics of the occurrence rule of the traffic violation event, aggregating the time, dividing the analysis time period, and converting the continuous variable into the nominal variable; in another embodiment, the time interval division is performed by other statistical means such as clustering.
Extracting the age, the gender and the province and city code of the driver according to the driver certificate number by the driver characteristic data; and generating an illegal data set according to the information extracted from each link, as shown in table 2.
TABLE 2. partial data of illegal data set
And S2, classifying the full sample I in the illegal data set into two categories, namely a high-risk driver and a common driver. Referring to fig. 4, in a case where a driver who has illegal records and has serious traffic accident records with major responsibility or all responsibility is taken as a high-risk driver, eligible data is classified as a data set D1; dividing the data meeting the conditions into a data set D2 according to another condition that the drivers with illegal records exist, only slight accident records exist and the accident records are not less than 2, and the drivers with the accident records are taken as high-risk drivers; the data set D of the high-risk drivers is D1+ D2. And synthesizing the corresponding data of the drivers with illegal records but no accident records into a general driver data set N.
Accordingly, a high-risk or general data label value label is determined for the data meeting the rule in the illegal data set, and the data subset U which cannot be applied to the classification rule is the data subset to be identified.
S3, constructing an initial vehicle driver risk prediction model P0 by adopting an optimized sampling method and an XgBoost algorithm, and determining the model sampling number and the SMOTE sampling proportion;
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size; in this embodiment, the sample size of the data set exceeds 84000, the sampling interval S is [200,4000], and the loop step k is 200.
S32 sample size nm=s0(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N; in this embodiment, the initial number of samples is 200.
S33, splitting the data set D and the Nm collection Gm into a training set and a test set; in this embodiment, the split ratio of the training set to the test set is 9: 1.
S34, SMOTE sampling is conducted on the training set, and a high-risk driver data subset D sample expansion proportion ai is set, wherein a1 is 1, ai is ai-1+1, the initial value of i is 1, i is provided with a set value upper limit, and the maximum value of i is 4;
s35, setting an Nm data subset shrinkage proportion bj of a general driver for the high-risk driver sample expansion proportion ai, wherein b1 is 1, bj is bj-1+1, the initial value of j is 1, j is provided with a set value upper limit, and the maximum value of j is 4; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training a high-risk driver classifier by applying an XgBoost algorithm, determining model parameters, and realizing a driver traffic accident risk prediction modelThe model can output a driver mark value and a risk probability; the model parameters comprise learning rate, the number of weak classifiers, maximum tree depth, node minimum split value, leaf node minimum sample number, leaf node weight sum minimum value, minimum loss function value, row sampling rate, column sampling rate, regularization item 1, regularization item 2, positive and negative weight balance item and early termination training condition;
S38, classifying the data in the Nm complement Nm' of the sampling samples in the general driver data subset N according to the illegal times, and inputting the data into the model according to the classificationCounting the misjudgment rate of the driver labels output by models under different coverage rates
Whether S39, j reaches the set maximum value; if yes, judging whether i reaches a set maximum value, if yes, entering S310, otherwise, entering S34 if i is i + 1; otherwise, j ═ j +1, go to S35;
s310, detecting whether nm reaches an interval upper limit S, if so, entering S311, otherwise, returning to S32 if m is m + 1;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rateAn optimal random sample number M, SMOTE sample ratio I, J is determined.
In this embodiment, the comparison analysis is performed by integrating the misjudgment rate, the accuracy and the index stability, and the determined optimal performance model isI.e., the number of randomly sampled samples is 2400 and the SMOTE ratio is 2: 2.
S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and a test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.
In this embodiment, the accuracy of the test set under 10-fold cross validation is used as a target function, and the parameters of the genetic algorithm are specifically set as: the cross selection probability crossselectivity is 0.8, the variation probability MutationProbability is 0.5, the variation interval Sigma [ -10,10], [ -2,2], [ -2,2], [ -2,2] ], the Population propagation algebra Iteration is 500, and the initial Population number position is 100. The propagation process of the genetic algorithm for parameter optimization is shown in FIG. 4.
S5, constructing an optimal fitting model P for predicting the risk of the vehicle driver according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold.
In the embodiment, the specific parameters of the initial model based on the XgBoost after being optimized by the genetic algorithm are as follows: learning rate learning _ rate _ value is 0.09, weak classifier number n _ estimators _ value is 367, maximum tree depth max _ depth _ value is 4, node minimum split value min _ samples _ split _ value is 10, leaf node minimum sample number min _ samples _ leaf _ value is 6, leaf node weight sum min _ child _ weight _ value is 3, minimum loss function value gamma _ value is 0, line sample rate subsample _ value is 0.45, column sample rate sample _ byte _ value is 0.1, regularization term 1reg _ lambda _ value 11, regularization term 2reg _ value, regularization term 11, positive and negative values of training term _ positive and negative values are terminated in advance by a condition of "weight value 1 _ value _ positive and negative values.
The accuracy of the model after parameter optimization reaches 0.76.
And S6, inputting the subset data to be identified of the S2 into the model P, and outputting the risk degree of the driver. Some of the results are shown in Table 3.
Table 3 analysis results of the risk degree of high-risk drivers using the method of the present invention
Claims (6)
1. A method for improving the accident risk prediction precision of traffic hazard personnel is characterized by comprising the following steps: the method comprises the following steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result, wherein the method specifically comprises the following steps:
s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data;
s2, classifying the illegal data set into two categories according to the serious traffic accident record of the serious accident data set and the light accident record of the light accident data set, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified;
s3, constructing an initial dangerous personnel accident risk prediction model P by adopting an optimized sampling method and an ensemble learning algorithm0Determining the sampling number and SMOTE sampling proportion of the model; the optimal sampling method in step S3 includes the following steps:
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N;
s32 sample size nm=s0+(m-1)·k,s0Is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly sampling N samples from the data set NmSample N ofm;
S33, data sets D and NmCollection GmSplitting the training set into a training set and a test set;
s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion a of the high-risk personnel data subset D is seti(ii) a Wherein, when i is 1, aiWhen i is equal to 1>1 time, ai=ai-1The initial value of +1, i is 1, and i is provided with a set upper value limit;
s35 sample expansion ratio a for high-risk personneliSetting general person NmData subset reduction scale bj(ii) a Wherein, when j is 1, bjWhen j is equal to 1>1 time, bj=bj-1The initial value of +1, j is 1, and j is provided with a set upper value limit; sampling ratio a for SMOTEi:bjCarrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participantsThe model can output a marker value and a risk probability;
s37, modeling with the test set dataEvaluating to obtain different coverageModel accuracy of cap rate
S38, sampling samples N in the general personnel data subset NmComplement Nm' Indata is classified according to the number of violations and input into the model by categoryCounting the misjudgment rate of personnel labels output by models under different coverage rates
Whether S39, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S310, otherwise, entering S34; otherwise, j ═ j +1, go to S35;
s310, detecting nmIf the sampling interval upper limit value S is reached, the process goes to S311, otherwise, m is m +1, and the process returns to S32;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rateDetermining an optimal random sampling number M, SMOTE sampling ratio I, J;
s4, adopting genetic algorithm to pair the model P0Performing performance optimization, wherein an optimization objective function of the performance optimization is used for predicting accuracy maximization for a test set, and a test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, variation intervals, population breeding algebra and initial population quantity;
s5, constructing an optimal fitting model P for predicting the accident risk of the dangerous personnel according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold;
and S6, inputting the subset data to be identified in the step S2 into the model P, and outputting the target object risk.
2. The method for improving the accident risk prediction accuracy of traffic hazard personnel according to claim 1, wherein the ensemble learning algorithm in step S3 comprises a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, and a GBDT algorithm.
3. The method for improving the accident risk prediction accuracy of the traffic hazard personnel according to claim 1, wherein the method for assigning the corresponding data label value label based on the classification rule in the step S2 specifically comprises:
high-risk personnel: one is a person who has illegal records and has serious traffic accident records with major responsibility or all responsibility; the other is the personnel who have illegal records, only have slight accident records and have no less than 2 accident records;
the average person: personnel with illegal records but no accident records;
and the data which do not meet the discrimination conditions of the high-risk personnel and the common personnel form a subset to be recognized.
4. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 1, wherein: the original traffic violation data and accident data in step S1 include personnel certificate information; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident illegal behavior occurrence conditions and illegal occurrence time intervals.
5. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 1, wherein: in step S1, the occurrence of the accident illegal activity is obtained by a corresponding analysis method, and the type of the illegal activity with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.
6. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 4, wherein: in step S1, the time-continuous variable is converted into a discrete variable, and the discrete variable is classified according to the characteristics of the time of violation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810783017.7A CN108596409B (en) | 2018-07-16 | 2018-07-16 | Method for improving accident risk prediction precision of traffic hazard personnel |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810783017.7A CN108596409B (en) | 2018-07-16 | 2018-07-16 | Method for improving accident risk prediction precision of traffic hazard personnel |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108596409A CN108596409A (en) | 2018-09-28 |
CN108596409B true CN108596409B (en) | 2021-07-20 |
Family
ID=63617732
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810783017.7A Active CN108596409B (en) | 2018-07-16 | 2018-07-16 | Method for improving accident risk prediction precision of traffic hazard personnel |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108596409B (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408557B (en) * | 2018-09-29 | 2021-09-28 | 东南大学 | Traffic accident cause analysis method based on multiple correspondences and K-means clustering |
CN109635990B (en) * | 2018-10-12 | 2022-09-16 | 创新先进技术有限公司 | Training method, prediction method, device, electronic equipment and storage medium |
CN109409430B (en) * | 2018-10-26 | 2021-07-13 | 江苏智通交通科技有限公司 | Traffic accident data intelligent analysis and comprehensive application system |
CN109558969A (en) * | 2018-11-07 | 2019-04-02 | 南京邮电大学 | A kind of VANETs car accident risk forecast model based on AdaBoost-SO |
CN109598931B (en) * | 2018-11-30 | 2021-06-11 | 江苏智通交通科技有限公司 | Group division and difference analysis method and system based on traffic safety risk |
CN110379161B (en) * | 2019-07-18 | 2021-02-02 | 中南大学 | Urban road network traffic flow distribution method |
CN111080012A (en) * | 2019-12-17 | 2020-04-28 | 北京明略软件***有限公司 | Personnel risk degree prediction method and device, electronic equipment and readable storage medium |
CN111081016B (en) * | 2019-12-18 | 2021-07-06 | 北京航空航天大学 | Urban traffic abnormity identification method based on complex network theory |
CN112016735B (en) * | 2020-07-17 | 2023-03-28 | 厦门大学 | Patrol route planning method and system based on traffic violation hotspot prediction and readable storage medium |
CN111881988B (en) * | 2020-07-31 | 2022-06-14 | 北京航空航天大学 | Heterogeneous unbalanced data fault detection method based on minority class oversampling method |
CN112667919A (en) * | 2020-12-28 | 2021-04-16 | 山东大学 | Personalized community correction scheme recommendation system based on text data and working method thereof |
CN113076974A (en) * | 2021-03-09 | 2021-07-06 | 麦哲伦科技有限公司 | Multi-task learning method with parallel filling and classification of missing values of multi-layer sensing mechanism |
CN113793502B (en) * | 2021-09-15 | 2022-08-09 | 国网电动汽车服务(天津)有限公司 | Pedestrian crossing prediction method under no-signal-lamp control |
CN115035722B (en) * | 2022-06-20 | 2024-04-05 | 浙江嘉兴数字城市实验室有限公司 | Road safety risk prediction method based on combination of space-time characteristics and social media |
CN117009767B (en) * | 2023-08-10 | 2024-04-26 | 中国环境科学研究院 | Soil benchmark formulation and risk assessment method based on bioavailability |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103246897A (en) * | 2013-05-27 | 2013-08-14 | 南京理工大学 | Internal structure adjusting method of weak classifier based on AdaBoost |
CN103462618A (en) * | 2013-09-04 | 2013-12-25 | 江苏大学 | Automobile driver fatigue detecting method based on steering wheel angle features |
JP5892663B2 (en) * | 2011-06-21 | 2016-03-23 | 国立大学法人 奈良先端科学技術大学院大学 | Self-position estimation device, self-position estimation method, self-position estimation program, and moving object |
CN107480839A (en) * | 2017-10-13 | 2017-12-15 | 深圳市博安达信息技术股份有限公司 | The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest |
CN107563425A (en) * | 2017-08-24 | 2018-01-09 | 长安大学 | A kind of method for building up of the tunnel operation state sensor model based on random forest |
-
2018
- 2018-07-16 CN CN201810783017.7A patent/CN108596409B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5892663B2 (en) * | 2011-06-21 | 2016-03-23 | 国立大学法人 奈良先端科学技術大学院大学 | Self-position estimation device, self-position estimation method, self-position estimation program, and moving object |
CN103246897A (en) * | 2013-05-27 | 2013-08-14 | 南京理工大学 | Internal structure adjusting method of weak classifier based on AdaBoost |
CN103462618A (en) * | 2013-09-04 | 2013-12-25 | 江苏大学 | Automobile driver fatigue detecting method based on steering wheel angle features |
CN107563425A (en) * | 2017-08-24 | 2018-01-09 | 长安大学 | A kind of method for building up of the tunnel operation state sensor model based on random forest |
CN107480839A (en) * | 2017-10-13 | 2017-12-15 | 深圳市博安达信息技术股份有限公司 | The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest |
Also Published As
Publication number | Publication date |
---|---|
CN108596409A (en) | 2018-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108596409B (en) | Method for improving accident risk prediction precision of traffic hazard personnel | |
CN110866677B (en) | Driver relative risk evaluation method based on benchmark analysis | |
CN105303197B (en) | A kind of vehicle follow the bus safety automation appraisal procedure based on machine learning | |
Ma et al. | Driving style recognition and comparisons among driving tasks based on driver behavior in the online car-hailing industry | |
CN111242484B (en) | Vehicle risk comprehensive evaluation method based on transition probability | |
CN104268599B (en) | Intelligent unlicensed vehicle finding method based on vehicle track temporal-spatial characteristic analysis | |
CN109191828B (en) | Traffic participant accident risk prediction method based on ensemble learning | |
CN103150900A (en) | Traffic jam event automatic detecting method based on videos | |
CN109671274B (en) | Highway risk automatic evaluation method based on feature construction and fusion | |
CN109086808B (en) | Traffic high-risk personnel identification method based on random forest algorithm | |
CN110929939B (en) | Landslide hazard susceptibility spatial prediction method based on clustering-information coupling model | |
CN109658272A (en) | Driving behavior evaluation system and Insurance Pricing system based on driving behavior | |
CN111563555A (en) | Driver driving behavior analysis method and system | |
CN114299742B (en) | Speed limit information dynamic identification and update recommendation method for expressway | |
Agrawal et al. | Towards real-time heavy goods vehicle driving behaviour classification in the united kingdom | |
CN109101568B (en) | XgBoost algorithm-based traffic high-risk personnel identification method | |
CN112149922A (en) | Method for predicting severity of accident in exit and entrance area of down-link of highway tunnel | |
CN109063751B (en) | Traffic high-risk personnel identification method based on gradient lifting decision tree algorithm | |
CN115192026B (en) | Tunnel driving load monitoring method and terminal | |
Hammit et al. | Radar-vision algorithms to process the trajectory-level driving data in the SHRP2 Naturalistic Driving Study | |
TWI617998B (en) | System and method for car number identification data filtering | |
CN110889468A (en) | Multi-model uncertain reproduction result analysis method capable of eliminating error information | |
CN109145953B (en) | Adaboost algorithm-based traffic high-risk personnel identification method | |
Abeyratne et al. | Applying big data analytics on motor vehicle collision predictions in New York City | |
CN115035722B (en) | Road safety risk prediction method based on combination of space-time characteristics and social media |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 211100 No. 19 Suyuan Avenue, Jiangning Economic and Technological Development Zone, Nanjing City, Jiangsu Province Applicant after: JIANGSU ZHITONG TRAFFIC TECHNOLOGY Co.,Ltd. Address before: 210006, Qinhuai District, Jiangsu, Nanjing should be 388 days street, Chenguang 1865 Technology Creative Industry Park E10 building on the third floor Applicant before: JIANGSU ZHITONG TRAFFIC TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |