CN108596409B - Method for improving accident risk prediction precision of traffic hazard personnel - Google Patents

Method for improving accident risk prediction precision of traffic hazard personnel Download PDF

Info

Publication number
CN108596409B
CN108596409B CN201810783017.7A CN201810783017A CN108596409B CN 108596409 B CN108596409 B CN 108596409B CN 201810783017 A CN201810783017 A CN 201810783017A CN 108596409 B CN108596409 B CN 108596409B
Authority
CN
China
Prior art keywords
accident
data
personnel
illegal
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810783017.7A
Other languages
Chinese (zh)
Other versions
CN108596409A (en
Inventor
刘林
陈凝
吕伟韬
马党生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Zhitong Traffic Technology Co ltd
Original Assignee
Jiangsu Zhitong Traffic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Zhitong Traffic Technology Co ltd filed Critical Jiangsu Zhitong Traffic Technology Co ltd
Priority to CN201810783017.7A priority Critical patent/CN108596409B/en
Publication of CN108596409A publication Critical patent/CN108596409A/en
Application granted granted Critical
Publication of CN108596409B publication Critical patent/CN108596409B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Game Theory and Decision Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Operations Research (AREA)
  • Educational Administration (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Traffic Control Systems (AREA)

Abstract

The invention provides a method for improving the accident risk prediction precision of traffic hazard personnel, which obtains traffic violation data and accident data samples by an optimized sampling method, trains a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and optimizes the model by a genetic algorithm. According to the method, the safety characteristics of traffic travelers are mined in traffic violation data by an integrated learning algorithm, the performance based on an initial model is improved by adopting an optimized sampling method in a sampling link of model construction, and the model parameters are optimized by using a genetic algorithm, so that the accident risk prediction precision of dangerous personnel is effectively improved.

Description

Method for improving accident risk prediction precision of traffic hazard personnel
Technical Field
The invention relates to a method for improving the accident risk prediction precision of traffic hazard personnel.
Background
Research shows that the traffic violation and the traffic accident have a relevant relationship, and the attributes and behaviors of drivers, pedestrians and other traffic participants reserved by the traffic violation can provide data support for human factor analysis in traffic safety. The data mining can be carried out according to the safety characteristics of the traffic offenders by utilizing the classification idea and according to the personnel attribute variables.
The traditional classification method is to find a classifier closest to an actual classification function in a space formed by various possible functions, but in the actual situation, only a preferred weak supervision model can be obtained, and the reliability of the model is poor. The ensemble learning algorithm improves the performance of the final model through the combination of the weakly supervised models. However, the complex parameter composition of the integrated learning model brings certain difficulty for improving the model effect. The genetic algorithm can solve the result of global optimum or approximate optimum exactly, and a feasible scheme for improving the precision is provided.
Disclosure of Invention
The invention aims to provide a method for improving the accident risk prediction precision of traffic hazard personnel, which adopts an integrated learning algorithm of optimized sampling and carries out parameter optimization through a genetic algorithm, thereby carrying out quantitative evaluation on the risk degree of the traffic participants with traffic violation records, filling the deficiency of the current quantitative analysis method of the factors of the traffic safety participants and effectively improving the accident risk prediction precision of the traffic hazard personnel.
The technical solution of the invention is as follows:
a method for improving the accident risk prediction precision of traffic hazard personnel comprises the following steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result,
s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data.
S2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified.
S3, constructing an initial traffic participant risk prediction model P0 by adopting an optimized sampling method and an integrated learning algorithm, and determining the sampling number and the SMOTE sampling proportion of the model.
S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and the test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.
S5, constructing an optimal fitting model P for predicting the accident risk of the dangerous personnel according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold;
and S6, inputting the subset data to be identified of the S2 into the model P, and outputting the target object risk.
Further, the ensemble learning algorithm in step S3 includes a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, and a GBDT algorithm.
Further, the optimal sampling method in step S3 includes the specific steps of:
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size;
s32 sample size nm=s0(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N;
s33, splitting the data set D and the Nm collection Gm into a training set and a test set;
s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; when i is equal to 1, ai is equal to 1, and when i is greater than 1, ai is equal to ai-1+1, the initial value of i is 1, and i is provided with a set upper value limit;
s35, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; when j is 1, bj is 1, when j is greater than 1, bj-1+1, the initial value of j is 1, and j is provided with a set upper value limit; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participants
Figure BDA0001731680350000021
The model can output a marker value and a risk probability;
s37, modeling with the test set data
Figure BDA0001731680350000022
Evaluating to obtain model accuracy of different coverage
Figure BDA0001731680350000023
S38, complementing the sampling sample Nm in the general personnel data subset N into Nm' data according to the lawClassifying the times and inputting the models according to the categories
Figure BDA0001731680350000024
Counting the misjudgment rate of personnel labels output by models under different coverage rates
Figure BDA0001731680350000031
Whether S39, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S310, otherwise, entering S34; otherwise, j ═ j +1, go to S35;
s310, detecting whether nm reaches a sampling interval upper limit value S, if so, entering S311, otherwise, returning to S32 if m is m + 1;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rate
Figure BDA0001731680350000032
An optimal random sample number M, SMOTE sample ratio I, J is determined.
Further, the method for assigning the corresponding data label value label based on the classification rule in step S2 specifically includes:
high-risk personnel: one is a person who has illegal records and has serious traffic accident records with major responsibility or all responsibility; the other is the personnel who have illegal records, only have slight accident records and have no less than 2 accident records;
the average person: personnel with illegal records but no accident records;
the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.
Further, the original traffic violation data and accident data in step S1 include the certificate information of the relevant person; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals.
Further, in step S1, the occurrence condition of the accident-related illegal activity is obtained by a corresponding analysis method, and the type of the violation with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.
Further, in step S1, the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying the discrete variable according to the illegal time characteristics.
The invention has the beneficial effects that:
firstly, the initial fitting model parameters are optimized by adopting a genetic algorithm, and the accident risk prediction precision of traffic hazard personnel is obviously improved.
Compared with traditional classification methods such as decision trees, neural networks and the like, the integrated learning algorithm adopted by the invention has obvious advantages in prediction performance, and ensures the accuracy of the prediction of the traffic accident risk of dangerous personnel.
And thirdly, mining traffic violation data by adopting an optimized ensemble learning algorithm, realizing quantitative evaluation of traffic safety risk degree based on violation records of traffic participants, and outputting the traffic risk degree of personnel by using a model.
Drawings
FIG. 1 is a schematic flow chart of a method for improving the accident risk prediction accuracy of traffic hazard personnel according to an embodiment of the invention.
Fig. 2 is a schematic flow chart of the optimal sampling method adopted in S3 in the embodiment.
FIG. 3 is an explanatory diagram of a data set in the embodiment.
FIG. 4 is a schematic diagram of the propagation process of the genetic algorithm employed in S5 in the example.
Detailed Description
Preferred embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Examples
A method for improving the accident risk prediction precision of traffic hazard personnel comprises the steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result, as shown in figure 1. According to the method, the safety characteristics of traffic travelers are mined in traffic violation data through an integrated learning algorithm, the performance based on an initial model is improved by adopting an optimized sampling method in a sampling link of model construction, and the model parameters are optimized by using a genetic algorithm, so that the accident risk prediction precision of dangerous personnel is effectively improved. The specific method comprises the following steps:
s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data.
The original traffic violation data and accident data comprise certificate information of related personnel; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident-related illegal behavior occurrence conditions and illegal occurrence time intervals; the occurrence condition of accident-related illegal behaviors is obtained through a corresponding analysis mode, and the illegal type with higher traffic accident influence degree is extracted and used as the data attribute of an illegal data set; the illegal occurrence time interval is obtained by converting a time continuous variable into a discrete variable and classifying according to the illegal time characteristics.
S2, classifying the illegal data set into two categories, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified.
The classification rules are specifically as follows: high risk personnel refer to (1) traffic participants (including motor vehicles, non-motor vehicle drivers, and pedestrians) who have illegal records and have serious traffic accident records with major or complete responsibility; (2) illegal records exist, only slight accident records exist, and the accident records are not less than 2 traffic participants; the general personnel refer to the traffic participants who have illegal records but no accident records; the data which do not satisfy the above-mentioned discrimination condition constitute a subset to be recognized.
S3, constructing an initial traffic participant risk prediction model P0 by adopting an optimized sampling method and an integrated learning algorithm, and determining the sampling number and the SMOTE sampling proportion of the model; the ensemble learning algorithm comprises a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm and a GBDT algorithm. As shown in fig. 2, the specific process is as follows:
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size;
s32 sample size nm=s0(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N;
s33, splitting the data set D and the Nm collection Gm into a training set and a test set;
s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion ai of the high-risk personnel data subset D is set; wherein, when i is 1, ai is 1, when i >1, ai-1+1, i has an upper limit of usually 4;
s35, setting a sample shrinkage ratio bj of an Nm data subset of general personnel for the sample expansion ratio ai of the high-risk personnel; wherein, when j is 1, bj is 1, when j >1, bj is bj-1+1, and j has an upper limit of usually 4; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participants
Figure BDA0001731680350000051
The model can output a marker value and a risk probability;
s37, modeling with the test set data
Figure BDA0001731680350000052
Evaluating to obtain model accuracy of different coverage
Figure BDA0001731680350000053
S38, sub-dividing general personnel dataThe data in the sampling sample Nm complement Nm' in the set N are classified according to the illegal times and input into the model according to the classification
Figure BDA0001731680350000054
Counting the misjudgment rate of personnel labels output by models under different coverage rates
Figure BDA0001731680350000055
Whether S39, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S310, otherwise, entering S34; otherwise, j ═ j +1, go to S35;
s310, detecting whether nm reaches a sampling interval upper limit value S, if so, entering S311, otherwise, returning to S32 if m is m + 1;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rate
Figure BDA0001731680350000061
An optimal random sample number M, SMOTE sample ratio I, J is determined.
S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and the test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.
S5, constructing an optimal fitting model P for predicting the accident risk of the dangerous personnel according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold;
and S6, inputting the subset data to be identified of the S2 into the model P, and outputting the target object risk.
Specific examples
The present embodiment takes a driver of a motor vehicle as an analysis target.
And S1, acquiring the traffic violation records and accident records of 2 years in the area by butting with the database.
The traffic accident with death or serious injury or hit-and-run accident is taken as a serious accident, other accidents are taken as slight accidents, the original accident records are classified according to the serious accident or serious injury or hit-and-run accident, the accident type and driver certificate information are taken as attribute characteristics of a serious accident data set and a slight accident data set, and sample data of the two data sets are obtained.
Further, the illegal original data are preprocessed, and illegal information of the driver is collected and counted, wherein the illegal information comprises accumulated illegal times, illegal types, accumulated deduction scores, average deduction scores (minutes/times), single maximum deduction scores, accumulated fines amount and average fines amount (yuan/times).
The method comprises the steps of performing dimensionality reduction treatment on traffic accident data and illegal original data by adopting a corresponding analysis method, classifying illegal types according to the relevance of the illegal and the type of the accident, and extracting five types with highest relevance as data attributes of an accident risk illegal behavior field, wherein the data attributes are shown in a table 1.
TABLE 1 event-related violation type partitioning
Figure BDA0001731680350000062
Figure BDA0001731680350000071
According to the traffic flow operation of the road network of the area where the embodiment is located and the characteristics of the occurrence rule of the traffic violation event, aggregating the time, dividing the analysis time period, and converting the continuous variable into the nominal variable; in another embodiment, the time interval division is performed by other statistical means such as clustering.
Extracting the age, the gender and the province and city code of the driver according to the driver certificate number by the driver characteristic data; and generating an illegal data set according to the information extracted from each link, as shown in table 2.
TABLE 2. partial data of illegal data set
Figure BDA0001731680350000072
And S2, classifying the full sample I in the illegal data set into two categories, namely a high-risk driver and a common driver. Referring to fig. 4, in a case where a driver who has illegal records and has serious traffic accident records with major responsibility or all responsibility is taken as a high-risk driver, eligible data is classified as a data set D1; dividing the data meeting the conditions into a data set D2 according to another condition that the drivers with illegal records exist, only slight accident records exist and the accident records are not less than 2, and the drivers with the accident records are taken as high-risk drivers; the data set D of the high-risk drivers is D1+ D2. And synthesizing the corresponding data of the drivers with illegal records but no accident records into a general driver data set N.
Accordingly, a high-risk or general data label value label is determined for the data meeting the rule in the illegal data set, and the data subset U which cannot be applied to the classification rule is the data subset to be identified.
S3, constructing an initial vehicle driver risk prediction model P0 by adopting an optimized sampling method and an XgBoost algorithm, and determining the model sampling number and the SMOTE sampling proportion;
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N, wherein the boundary S on the interval generally does not exceed 25% of the total sample size; in this embodiment, the sample size of the data set exceeds 84000, the sampling interval S is [200,4000], and the loop step k is 200.
S32 sample size nm=s0(m-1) k, s0 is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly extracting a sample Nm with the sample size of Nm from the data set N; in this embodiment, the initial number of samples is 200.
S33, splitting the data set D and the Nm collection Gm into a training set and a test set; in this embodiment, the split ratio of the training set to the test set is 9: 1.
S34, SMOTE sampling is conducted on the training set, and a high-risk driver data subset D sample expansion proportion ai is set, wherein a1 is 1, ai is ai-1+1, the initial value of i is 1, i is provided with a set value upper limit, and the maximum value of i is 4;
s35, setting an Nm data subset shrinkage proportion bj of a general driver for the high-risk driver sample expansion proportion ai, wherein b1 is 1, bj is bj-1+1, the initial value of j is 1, j is provided with a set value upper limit, and the maximum value of j is 4; for the SMOTE sampling ratio ai: bj, carrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training a high-risk driver classifier by applying an XgBoost algorithm, determining model parameters, and realizing a driver traffic accident risk prediction model
Figure BDA0001731680350000081
The model can output a driver mark value and a risk probability; the model parameters comprise learning rate, the number of weak classifiers, maximum tree depth, node minimum split value, leaf node minimum sample number, leaf node weight sum minimum value, minimum loss function value, row sampling rate, column sampling rate, regularization item 1, regularization item 2, positive and negative weight balance item and early termination training condition;
s37, modeling with the test set data
Figure BDA0001731680350000082
Evaluating to obtain model accuracy of different coverage
Figure BDA0001731680350000083
S38, classifying the data in the Nm complement Nm' of the sampling samples in the general driver data subset N according to the illegal times, and inputting the data into the model according to the classification
Figure BDA0001731680350000091
Counting the misjudgment rate of the driver labels output by models under different coverage rates
Figure BDA0001731680350000092
Whether S39, j reaches the set maximum value; if yes, judging whether i reaches a set maximum value, if yes, entering S310, otherwise, entering S34 if i is i + 1; otherwise, j ═ j +1, go to S35;
s310, detecting whether nm reaches an interval upper limit S, if so, entering S311, otherwise, returning to S32 if m is m + 1;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rate
Figure BDA0001731680350000093
An optimal random sample number M, SMOTE sample ratio I, J is determined.
In this embodiment, the comparison analysis is performed by integrating the misjudgment rate, the accuracy and the index stability, and the determined optimal performance model is
Figure BDA0001731680350000094
I.e., the number of randomly sampled samples is 2400 and the SMOTE ratio is 2: 2.
S4, optimizing the performance of the model P0 by adopting a genetic algorithm, wherein an optimized objective function of the model P0 is the prediction accuracy maximization of a test set, and a test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, mutation intervals, population breeding algebra and initial population quantity.
In this embodiment, the accuracy of the test set under 10-fold cross validation is used as a target function, and the parameters of the genetic algorithm are specifically set as: the cross selection probability crossselectivity is 0.8, the variation probability MutationProbability is 0.5, the variation interval Sigma [ -10,10], [ -2,2], [ -2,2], [ -2,2] ], the Population propagation algebra Iteration is 500, and the initial Population number position is 100. The propagation process of the genetic algorithm for parameter optimization is shown in FIG. 4.
S5, constructing an optimal fitting model P for predicting the risk of the vehicle driver according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold.
In the embodiment, the specific parameters of the initial model based on the XgBoost after being optimized by the genetic algorithm are as follows: learning rate learning _ rate _ value is 0.09, weak classifier number n _ estimators _ value is 367, maximum tree depth max _ depth _ value is 4, node minimum split value min _ samples _ split _ value is 10, leaf node minimum sample number min _ samples _ leaf _ value is 6, leaf node weight sum min _ child _ weight _ value is 3, minimum loss function value gamma _ value is 0, line sample rate subsample _ value is 0.45, column sample rate sample _ byte _ value is 0.1, regularization term 1reg _ lambda _ value 11, regularization term 2reg _ value, regularization term 11, positive and negative values of training term _ positive and negative values are terminated in advance by a condition of "weight value 1 _ value _ positive and negative values.
The accuracy of the model after parameter optimization reaches 0.76.
And S6, inputting the subset data to be identified of the S2 into the model P, and outputting the risk degree of the driver. Some of the results are shown in Table 3.
Table 3 analysis results of the risk degree of high-risk drivers using the method of the present invention
Figure BDA0001731680350000101

Claims (6)

1. A method for improving the accident risk prediction precision of traffic hazard personnel is characterized by comprising the following steps: the method comprises the following steps of obtaining traffic violation data and accident data samples by an optimized sampling method, training a traffic accident risk prediction model of a traffic participant by adopting an integrated learning algorithm, and further optimizing the model by a genetic algorithm to improve the accuracy of a prediction result, wherein the method specifically comprises the following steps:
s1, constructing an illegal data set, a serious accident data set and a slight accident data set based on the original traffic illegal data and accident data;
s2, classifying the illegal data set into two categories according to the serious traffic accident record of the serious accident data set and the light accident record of the light accident data set, namely high-risk personnel and general personnel, determining a data label value label according to a classification rule, and accordingly dividing the illegal data set into a high-risk personnel data subset D, a general personnel data subset N and a subset U to be identified;
s3, constructing an initial dangerous personnel accident risk prediction model P by adopting an optimized sampling method and an ensemble learning algorithm0Determining the sampling number and SMOTE sampling proportion of the model; the optimal sampling method in step S3 includes the following steps:
s31, setting a sampling interval S and a cycle step k according to the sample size of the data set N;
s32 sample size nm=s0+(m-1)·k,s0Is the lower limit value of the sampling interval, m is the cycle number, and the initial value is 1; randomly sampling N samples from the data set NmSample N ofm
S33, data sets D and NmCollection GmSplitting the training set into a training set and a test set;
s34, SMOTE sampling is carried out on the training set, and the sample expansion proportion a of the high-risk personnel data subset D is seti(ii) a Wherein, when i is 1, aiWhen i is equal to 1>1 time, ai=ai-1The initial value of +1, i is 1, and i is provided with a set upper value limit;
s35 sample expansion ratio a for high-risk personneliSetting general person NmData subset reduction scale bj(ii) a Wherein, when j is 1, bjWhen j is equal to 1>1 time, bj=bj-1The initial value of +1, j is 1, and j is provided with a set upper value limit; sampling ratio a for SMOTEi:bjCarrying out sample expansion and sample contraction treatment on two types of label samples in a training set to be used as a training sample set of the classifier;
s36, training the high-risk personnel classifier by applying an ensemble learning algorithm, determining model parameters, and realizing a traffic accident risk prediction model for traffic participants
Figure FDA0003029175660000011
The model can output a marker value and a risk probability;
s37, modeling with the test set data
Figure FDA0003029175660000012
Evaluating to obtain different coverageModel accuracy of cap rate
Figure FDA0003029175660000013
S38, sampling samples N in the general personnel data subset NmComplement Nm' Indata is classified according to the number of violations and input into the model by category
Figure FDA0003029175660000021
Counting the misjudgment rate of personnel labels output by models under different coverage rates
Figure FDA0003029175660000022
Whether S39, j reaches the upper limit of the value; if yes, judging whether i reaches the upper value limit, if yes, entering S310, otherwise, entering S34; otherwise, j ═ j +1, go to S35;
s310, detecting nmIf the sampling interval upper limit value S is reached, the process goes to S311, otherwise, m is m +1, and the process returns to S32;
s311, analyzing the model with the optimal performance according to the model accuracy and the misjudgment rate
Figure FDA0003029175660000023
Determining an optimal random sampling number M, SMOTE sampling ratio I, J;
s4, adopting genetic algorithm to pair the model P0Performing performance optimization, wherein an optimization objective function of the performance optimization is used for predicting accuracy maximization for a test set, and a test set accuracy analysis method is k-fold cross validation; setting genetic algorithm parameters to ensure that the convergence speed of a target function is high and avoid the situation of non-convergence of oscillation; the genetic algorithm parameters comprise cross selection probability, variation intervals, population breeding algebra and initial population quantity;
s5, constructing an optimal fitting model P for predicting the accident risk of the dangerous personnel according to the target optimal model parameters output by the genetic algorithm, and determining the model coverage rate recall and the model discrimination threshold;
and S6, inputting the subset data to be identified in the step S2 into the model P, and outputting the target object risk.
2. The method for improving the accident risk prediction accuracy of traffic hazard personnel according to claim 1, wherein the ensemble learning algorithm in step S3 comprises a random forest algorithm, an AdaBoost algorithm, an XgBoost algorithm, and a GBDT algorithm.
3. The method for improving the accident risk prediction accuracy of the traffic hazard personnel according to claim 1, wherein the method for assigning the corresponding data label value label based on the classification rule in the step S2 specifically comprises:
high-risk personnel: one is a person who has illegal records and has serious traffic accident records with major responsibility or all responsibility; the other is the personnel who have illegal records, only have slight accident records and have no less than 2 accident records;
the average person: personnel with illegal records but no accident records;
and the data which do not meet the discrimination conditions of the high-risk personnel and the common personnel form a subset to be recognized.
4. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 1, wherein: the original traffic violation data and accident data in step S1 include personnel certificate information; collecting and classifying illegal records to obtain an illegal data set; the illegal data set records full sample data for the illegal, and the information of the illegal data set comprises personnel certificate numbers, illegal times, illegal types, punishment conditions, accident illegal behavior occurrence conditions and illegal occurrence time intervals.
5. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 1, wherein: in step S1, the occurrence of the accident illegal activity is obtained by a corresponding analysis method, and the type of the illegal activity with a high degree of influence of the traffic accident is extracted as the data attribute of the illegal data set.
6. The method for improving the accident risk prediction accuracy of traffic hazard personnel of claim 4, wherein: in step S1, the time-continuous variable is converted into a discrete variable, and the discrete variable is classified according to the characteristics of the time of violation.
CN201810783017.7A 2018-07-16 2018-07-16 Method for improving accident risk prediction precision of traffic hazard personnel Active CN108596409B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810783017.7A CN108596409B (en) 2018-07-16 2018-07-16 Method for improving accident risk prediction precision of traffic hazard personnel

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810783017.7A CN108596409B (en) 2018-07-16 2018-07-16 Method for improving accident risk prediction precision of traffic hazard personnel

Publications (2)

Publication Number Publication Date
CN108596409A CN108596409A (en) 2018-09-28
CN108596409B true CN108596409B (en) 2021-07-20

Family

ID=63617732

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810783017.7A Active CN108596409B (en) 2018-07-16 2018-07-16 Method for improving accident risk prediction precision of traffic hazard personnel

Country Status (1)

Country Link
CN (1) CN108596409B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408557B (en) * 2018-09-29 2021-09-28 东南大学 Traffic accident cause analysis method based on multiple correspondences and K-means clustering
CN109635990B (en) * 2018-10-12 2022-09-16 创新先进技术有限公司 Training method, prediction method, device, electronic equipment and storage medium
CN109409430B (en) * 2018-10-26 2021-07-13 江苏智通交通科技有限公司 Traffic accident data intelligent analysis and comprehensive application system
CN109558969A (en) * 2018-11-07 2019-04-02 南京邮电大学 A kind of VANETs car accident risk forecast model based on AdaBoost-SO
CN109598931B (en) * 2018-11-30 2021-06-11 江苏智通交通科技有限公司 Group division and difference analysis method and system based on traffic safety risk
CN110379161B (en) * 2019-07-18 2021-02-02 中南大学 Urban road network traffic flow distribution method
CN111080012A (en) * 2019-12-17 2020-04-28 北京明略软件***有限公司 Personnel risk degree prediction method and device, electronic equipment and readable storage medium
CN111081016B (en) * 2019-12-18 2021-07-06 北京航空航天大学 Urban traffic abnormity identification method based on complex network theory
CN112016735B (en) * 2020-07-17 2023-03-28 厦门大学 Patrol route planning method and system based on traffic violation hotspot prediction and readable storage medium
CN111881988B (en) * 2020-07-31 2022-06-14 北京航空航天大学 Heterogeneous unbalanced data fault detection method based on minority class oversampling method
CN112667919A (en) * 2020-12-28 2021-04-16 山东大学 Personalized community correction scheme recommendation system based on text data and working method thereof
CN113076974A (en) * 2021-03-09 2021-07-06 麦哲伦科技有限公司 Multi-task learning method with parallel filling and classification of missing values of multi-layer sensing mechanism
CN113793502B (en) * 2021-09-15 2022-08-09 国网电动汽车服务(天津)有限公司 Pedestrian crossing prediction method under no-signal-lamp control
CN115035722B (en) * 2022-06-20 2024-04-05 浙江嘉兴数字城市实验室有限公司 Road safety risk prediction method based on combination of space-time characteristics and social media
CN117009767B (en) * 2023-08-10 2024-04-26 中国环境科学研究院 Soil benchmark formulation and risk assessment method based on bioavailability

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103246897A (en) * 2013-05-27 2013-08-14 南京理工大学 Internal structure adjusting method of weak classifier based on AdaBoost
CN103462618A (en) * 2013-09-04 2013-12-25 江苏大学 Automobile driver fatigue detecting method based on steering wheel angle features
JP5892663B2 (en) * 2011-06-21 2016-03-23 国立大学法人 奈良先端科学技術大学院大学 Self-position estimation device, self-position estimation method, self-position estimation program, and moving object
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest
CN107563425A (en) * 2017-08-24 2018-01-09 长安大学 A kind of method for building up of the tunnel operation state sensor model based on random forest

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5892663B2 (en) * 2011-06-21 2016-03-23 国立大学法人 奈良先端科学技術大学院大学 Self-position estimation device, self-position estimation method, self-position estimation program, and moving object
CN103246897A (en) * 2013-05-27 2013-08-14 南京理工大学 Internal structure adjusting method of weak classifier based on AdaBoost
CN103462618A (en) * 2013-09-04 2013-12-25 江苏大学 Automobile driver fatigue detecting method based on steering wheel angle features
CN107563425A (en) * 2017-08-24 2018-01-09 长安大学 A kind of method for building up of the tunnel operation state sensor model based on random forest
CN107480839A (en) * 2017-10-13 2017-12-15 深圳市博安达信息技术股份有限公司 The classification Forecasting Methodology of high-risk pollution sources based on principal component analysis and random forest

Also Published As

Publication number Publication date
CN108596409A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
CN108596409B (en) Method for improving accident risk prediction precision of traffic hazard personnel
CN110866677B (en) Driver relative risk evaluation method based on benchmark analysis
CN105303197B (en) A kind of vehicle follow the bus safety automation appraisal procedure based on machine learning
Ma et al. Driving style recognition and comparisons among driving tasks based on driver behavior in the online car-hailing industry
CN111242484B (en) Vehicle risk comprehensive evaluation method based on transition probability
CN104268599B (en) Intelligent unlicensed vehicle finding method based on vehicle track temporal-spatial characteristic analysis
CN109191828B (en) Traffic participant accident risk prediction method based on ensemble learning
CN103150900A (en) Traffic jam event automatic detecting method based on videos
CN109671274B (en) Highway risk automatic evaluation method based on feature construction and fusion
CN109086808B (en) Traffic high-risk personnel identification method based on random forest algorithm
CN110929939B (en) Landslide hazard susceptibility spatial prediction method based on clustering-information coupling model
CN109658272A (en) Driving behavior evaluation system and Insurance Pricing system based on driving behavior
CN111563555A (en) Driver driving behavior analysis method and system
CN114299742B (en) Speed limit information dynamic identification and update recommendation method for expressway
Agrawal et al. Towards real-time heavy goods vehicle driving behaviour classification in the united kingdom
CN109101568B (en) XgBoost algorithm-based traffic high-risk personnel identification method
CN112149922A (en) Method for predicting severity of accident in exit and entrance area of down-link of highway tunnel
CN109063751B (en) Traffic high-risk personnel identification method based on gradient lifting decision tree algorithm
CN115192026B (en) Tunnel driving load monitoring method and terminal
Hammit et al. Radar-vision algorithms to process the trajectory-level driving data in the SHRP2 Naturalistic Driving Study
TWI617998B (en) System and method for car number identification data filtering
CN110889468A (en) Multi-model uncertain reproduction result analysis method capable of eliminating error information
CN109145953B (en) Adaboost algorithm-based traffic high-risk personnel identification method
Abeyratne et al. Applying big data analytics on motor vehicle collision predictions in New York City
CN115035722B (en) Road safety risk prediction method based on combination of space-time characteristics and social media

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 211100 No. 19 Suyuan Avenue, Jiangning Economic and Technological Development Zone, Nanjing City, Jiangsu Province

Applicant after: JIANGSU ZHITONG TRAFFIC TECHNOLOGY Co.,Ltd.

Address before: 210006, Qinhuai District, Jiangsu, Nanjing should be 388 days street, Chenguang 1865 Technology Creative Industry Park E10 building on the third floor

Applicant before: JIANGSU ZHITONG TRAFFIC TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant