CN109344907A

CN109344907A - Based on the method for discrimination for improving judgment criteria sorting algorithm

Info

Publication number: CN109344907A
Application number: CN201811272036.XA
Authority: CN
Inventors: 顾海艳
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2019-02-15

Abstract

A kind of method of discrimination based on improvement judgment criteria sorting algorithm is to propose that one kind selectes random forest parameter based on more judging quotas by taking random forests algorithm as an example, up-samples the scheme that balance sample is distributed to construct new Random Forest model.By comparing improved Random Forest model and original Random Forest model, Logic Regression Models and supporting vector machine model, obtain the improved more excellent conclusion of Random Forest model performance, that is to say, it is bright based on more judging quota selected algorithm parameters be a kind of feasible scheme.Present method solves in the prior art, the usual maintenance data of judgement of actual scene classification excavate in sorting algorithm, but the problem of the sorting algorithm of usually data mining constructs model with single index, and the differentiation effect of model is not so good as people's will.

Description

Based on the method for discrimination for improving judgment criteria sorting algorithm

Technical field

It is specifically a kind of based on improving judgment criteria sorting algorithm the invention belongs to the application field of data mining technology Method of discrimination.

Background technique

Data mining technology plays an increasingly important role in life production, applies to speech recognition, and image is known Not, in the actual scenes such as commercial product recommending.Sorting algorithm therein is one of important support of data mining technology.One perfect Sorting algorithm can match in excellence or beauty perception of the mankind to things.But since still there are various each for present traditional classification algorithm The defect of sample can not effectively divide things so still perfect sorting algorithm cannot be deserved to be called under special scenes Class.Therefore, it is necessary to improve to traditional sorting algorithm, perfect sorting algorithm can be become closer to it.

Summary of the invention

To solve the above-mentioned problems, the present invention proposes a kind of new actual scene category classification method, the thinking of this method It is described below:

Random forests algorithm was proposed in 2001 by Breiman, as a kind of efficient discriminant classification method, was applied to each A field.The principle of random forest is that the forest of a decision tree is established with random manner, between every one tree in forest Almost without association (could also say that association is smaller).It, can be by inputting new sample after Random Forest model building finishes Eigen differentiates that the classification of sample to be tested, the precision of differentiation have raising by a relatively large margin for common decision tree.

The present invention is a kind of based on the method for discrimination for improving judgment criteria sorting algorithm, and step includes:

One, acquisition characteristics achievement data is first passed through as sample data, constructs Random Forest model；

Two, again in actual classification scene, the characteristic index data of personnel to be measured are acquired, are obtained using step 1 random Forest model carries out quick discrimination to characteristic index data, differentiates the classification of personnel to be measured.

Random forests algorithm in step 1:

1, original random forests algorithm

Single decision tree is there are error is larger and the risk of over-fitting, in order to solve the problems, such as that decision tree exists, Breiman proposed random forests algorithm in 2001.The core concept of random forests algorithm is,

1) sample data for extracting same size data volume is put back to firstly, concentrating from initial data；

2) a certain number of features then, are extracted from primitive character variable, are constituted into character subset；

3) finally, constructing decision tree with the sample data and character subset not beta pruning extracted.

It repeats above-mentioned three step and operates n times, form N decision tree, decision tree is integrated, using the criterion of majority ballot, most The building of Random Forest model is completed eventually.

New samples characteristic variable input model, Random Forest model using the consistent result of the judgement of majority decision tree as Final result.

Random forest is capable of handling high dimensional data, and it goes without doing Feature Selection, being capable of rapid build model.But model Only depended in training process it is estimating outside bag as a result, and model evaluation index it is single, with single evaluation index selection parameter, It is easy to cause the optimistic estimate of model performance.When sample data imbalance, it is easy to tend to most class samples, minority class is sentenced Other effect is poor.Therefore it needs to overcome three disadvantages above.The invention proposes improved random forests algorithms.

2, improved random forests algorithm

Be directed to only depended on existing for original Random Forest model the result estimated outside bag, model evaluation index it is single and Model tends to the problem of most class samples when sample imbalance, and the present invention proposes a kind of improved random forests algorithm.

A, for the improvement for only depending on estimated result outside bag.

The judge of original random forest only depends on to be estimated outside bag, this will be easy to cause the optimistic estimate of assessment result. In order to overcome this drawback,

Data are first trained the division of collection and test set by the present invention, and cross validation is carried out on training set, pass through friendship Fork verification result carrys out the performance of entry evaluation model and determines parameter；Again on test set assessment models performance.

The Performance Evaluation of both cross validation and test set is estimated to model performance outside bag better than only depending on to rely only on Assessment.

B, the improvement single for model performance evaluation index.

The model evaluation of original random forest only depends on single evaluation index, cannot be effectively anti-in evaluation process Answer sample class uneven or sample class importance this information.In order to overcome this drawback,

The present invention is in model training stage:

Firstly, calculating F1 statistic, optimal F1 statistic is selected and lower than the F1 system within 1 point 5 standard deviation The model parameter of metering；

Then, nicety of grading is calculated in previous step candidate parameter, selects optimal nicety of grading and lower than a bit Nicety of grading within five standard deviations；Parameter combination corresponding to these niceties of grading is as candidate parameter combination；

Finally, calculating AUC in a upper candidate parameter, select within optimal AUC and lower than 1 point 5 standard deviation AUC, parameter combination corresponding to these AUC is as candidate parameter combination.

Candidate parameter in above-mentioned steps is substituted on test set, F1 statistic shows optimal parameter combination on test set As last parameter combination.Using the model performance of final parameter as the assessment of final model performance.

C, it is partial to the improvement that most classes differentiate for sample data imbalance model.

Here mainly from data distribution angle is changed, main policies have up-sampling and down-sampling.When data distribution not Balance, and when the quantity of two classifications is not especially more, using up-sampling strategy, expand the quantity of minority class sample；Work as number It is uneven according to distribution, and two classifications quantity it is all many when, using down-sampling strategy, the quantity of less majority class samples.

In the prior art: comprehensive evaluation index F1 is the harmonic average of accurate rate (also referred to as precision ratio) P and recall rate R Number.AUC is area under ROC curve.

Method of the invention applies to Random Forest model in actual classification scene, for existing for Random Forest model Deficiency and primary data sample are unevenly distributed the actual conditions of weighing apparatus, by being improved to existing random forests algorithm, if Determine many indexes search optimized parameter, artificial sample is constructed to raw sample data, forms new data set.It is quasi- with optimized parameter Sample data is closed, new Random Forest model is constructed.The result shows that the performance based on improved Random Forest model is mentioned It rises, is suitable for actual classification scene.

Detailed description of the invention

Fig. 1 is the corresponding ROC curve figure of maximal accuracy in original random forests algorithm real example part；

Fig. 2 .1,2. and 2.3 are improved in Random Forest model real example part respectively, and the corresponding ROC of the AUC value of table 2.3 is bent Line；

Fig. 2 .4 is improved in Random Forest model real example part, three times test set ROC curve figure；

Fig. 3 .1,3.2 and 3.3 are respectively in model rating unit, and the training set three times and test set of three models divide ROC curve and AUC value.

Fig. 3 .4,3.5 and 3.6 are the ROC song that the test set three times of front and back model is improved in model rating unit respectively Line and AUC value.

Specific embodiment

The present invention is further described with specific embodiment with reference to the accompanying drawing.

1, original random forests algorithm real example

In order to show the improvement effect of model, select a classified sample set data as the data set being fitted, and And the positive and negative sample proportion of data data set is 1:3.Characteristic variable is feature1, feature2, feature3, Feature4, feature4, feature5, feature6, y.Wherein y is variable to be sorted.

1.1 data prediction

(1) multicollinearity is eliminated

Logarithm type characteristic variable feature1, feature2, feature3, feature4, feature4, feature5 Test for multi-collinearity is carried out, inspection result is as shown in table 1.1:

Table 1.1

As shown in table 1.1, the absolute value of the related coefficient between numeric type characteristic variable shows characteristic variable less than 0.5 Between linear dependence it is weaker, can will these characteristic variables substitute into Random Forest model in.

(2) degree of bias is corrected

Logarithm type characteristic variable carries out variable normal distribution and examines, and the index of selection is the degree of bias of variable.Each variable The degree of bias as shown in table 1.2:

Table 1.2

Since the degree of bias of feature1, feature2, feature3, feature5 are larger, so needing to these features Variable carries out degree of bias transformation, is converted here using Box-Cox.The transformed data degree of bias is as shown in table 3:

Table 1.3

Characteristic variable by transformation is more nearly normal distribution than original characteristic variable.

(3) it standardizes

Logarithm type variable is standardized transformation.By the data mean value that Box-Cox is converted, standard deviation such as 1.4 institute of table Show:

Table 1.4

By the mean value of the data of standardized transformation, standard deviation is as shown in table 1.5:

Table 1.5

Because subtype variable feature6 only has two states, therefore do not need to do it one-hot coding operation.

1.2 Random Forest models construction

The process of Random Forest model building is as follows:

(1) determine that characteristic variable sum is 6, the number m for constructing the characteristic variable of the character subset of single decision tree can be with It is 2,3,4；

(2) the tree n for determining forest tree, is set as 10,50,100,150,200,300,500；

(3) cartesian product for calculating the tree of character subset number and tree, obtains parameter combination (m, n)；

(4) by each group of parameter fitting Random Forest model, 3 × 7=21 Random Forest model is obtained；

(5) precision estimated outside the bag of each Random Forest model is obtained, the highest parameter of choice accuracy is as optimal ginseng Array is closed；

(6) Random Forest model is fitted with best parameter group and total data.

Table 1.6 is the precision estimated outside whole Random Forest model bags under whole parameter combinations:

Table 1.6

Table 1.6 shows when the number of the characteristic variable of character subset is 3, and a number for random forest tree is 50, random gloomy Woods model enables to the precision estimated outside bag to reach maximum, and maximum precision is 78.09%.

Maximal accuracy corresponds to the precision estimated outside bag, precision ratio, and the value of recall rate and F1 statistic is as shown in table 1.7.

Table 1.7

The corresponding ROC curve of maximal accuracy is as shown in Figure 1, the value of AUC is 0.77.

The Random Forest model analysis that parameter combination is (3,50) is found, model accuracy 78.09%, precision ratio is 75.36%, recall rate 70.27%, F1 72.73%.The ROC value of model is 0.77.Due in sample data, negative sample Quantity is more than the quantity of positive sample, so this result occur meets reality.

The final result of the model of original random forest construction shows that the maximal accuracy of model is 78.09%, and precision ratio is 75.36%, recall rate 70.27%, F1 72.73%, it is contemplated that this result occur in this actual conditions of sample imbalance Meet reality.Since the model of original random forest building cannot effectively differentiate positive sample, it is therefore desirable to consideration pair Original random forests algorithm improves, and can take into account the differentiation of positive and negative two classes sample.It is answered in the index selection of model The multiple indexes integrate, rather than single index determines the parameter of model.

2, Random Forest model real example is improved

2.1 sample equilibratings

Since the sample distribution of data data is uneven, and the negligible amounts of positive negative sample, therefore be suitble to using up-sampling Method.The present invention mainly uses SMOTE algorithm to up-sample.

The basis of SMOTE (Synthetic Minority Oversampling Technique) algorithm is to cross to adopt at random Sample algorithm, but since random over-sampling is the simple copy to minority class sample, this will lead to the over-fitting of model.For The drawbacks of random over-sampling, the proposition of SMOTE algorithm first analyzes minority class sample, and synthesizes people based on the analysis results Work sample rather than simple copy.Algorithm flow is as follows:

(1) sample is calculated to minority class using Euclidean distance as module for each sample x of minority class The distance of whole samples, and its k neighbour is determined according to Euclidean distance.

(2) the uneven ratio n for calculating positive negative sample, determines the multiple n of sampling, neighbour is randomly choosed from k neighbour, false If the neighbour selected is y

(3) for each the neighbour y selected at random, new samples are constructed:

X_new=x+rand (0,1) × | x-y |

SMOTE algorithm is used to data data, data reach balance.The ratio of positive negative sample is approximately 1:1.

2.2 training sets and test set divide

Random forest is not due to that can have to be trained data collection and test set is divided there are estimation outside bag.But It is the optimistic estimate due to estimating to may result in model performance outside bag, more true model generalization performance in order to obtain Assessment, need to be trained data data collection and test set and divide.The division proportion of training set and test set is set as 3: 1.Data can be carried out with the division of training set and test set in triplicate, the assessment of the Generalization Capability of model is relatively reliable.

2.3 optimized parameters determine

(1) using the F1 statistic of initial data as evaluation index, first round screening is carried out to parameter combination.Table 2.1 is complete The F1 statistic estimated outside whole Random Forest model bags under portion's parameter combination.

Table 2.1

The maximum value of F1 statistic is 72.82%, standard deviation 2.5%, therefore is lower than 1 point 5 standard deviation of maximum value Range is 68.98%~72.82%, therefore candidate parameter combination has (2,10), (3,10), (3,50), (3,100), under One wheel screening.

(2) using initial data precision as evaluation index, the second wheel is carried out to parameter combination and is screened.Table 2.2 is waited in turn for second Select the precision estimated outside whole Random Forest model bags under parameter combination.

Table 2.2

Precision maximum value is 78.09%, standard deviation 2.1%, therefore is lower than 1 point 5 standard deviation range of maximum value 75.00%~78.09%, therefore candidate parameter combination has (3,10), (3,50), (3,100) are screened into next round.

(3) using initial data AUC value as evaluation index, third round screening is carried out to parameter combination.Table 2.3 is waited in turn for third Select the AUC value estimated outside whole Random Forest model bags under parameter combination.Fig. 2 .1,2.2 to Fig. 2 .3 are the corresponding ROC of AUC value Curve.

Table 2.3

AUC value is up to 0.77, standard deviation 0.05, thus lower than 1 point 5 standard deviation range of maximum value be 0.75~ 0.77, therefore candidate parameter combination has (3,50), (3,100) are screened into next round.

(4) using the F1 statistic of test set as evaluation index, fourth round screening is carried out to parameter combination.Table 2.4 is complete F1 statistic of whole Random Forest models on test set under portion's parameter combination.

Table 2.4

Table 2.4 shows to intend when the number of the characteristic variable of character subset is 3, and a tree for the tree of forest is 100 Close out the best Random Forest model of performance.Due to being to determine optimized parameter in initial data, still without solving positive and negative sample This unbalanced problem, it is therefore desirable in determining optimized parameter, final random forest is constructed on the data set of up-sampling Model.

2.4 models fitting

The process of models fitting is as follows:

(1) training set test set divides

(2) SMOTE algorithm construction artificial sample is carried out in training set, is added in initial data, forms new training set

(3) Random Forest model fitting is carried out in the parameter of new training set determination

The training set three times and test set division, the precision estimated outside bag on training set three times for carrying out total data look into standard Rate, recall rate and F1 statistic are as shown in table 2.5, and the result of test set prediction is as shown in table 2.6 three times.Test set three times ROC curve is as shown in Fig. 2 .4.

Table 2.5

Table 2.6

Table 2.5 the result shows that, overall precision is estimated outside the bag of improved Random Forest model 81% or so, precision ratio exists 81% or so, for recall rate in 80% or so, F1 statistic 80% or so, the overall performance that model is estimated outside bag is excellent.It is former Estimate outside beginning random forest-model bag overall precision 78% or so, precision ratio 75% or so, recall rate 70 or so, F1 statistic is 72% or so.Improved Random Forest model is in precision, precision ratio, recall rate and F1 statistic better than original Random Forest model.

Table 2.6 the result shows that, in test set overall precision 81% or so, precision ratio exists improved Random Forest model 80% or so, for recall rate in 80% or so, F1 statistic 80% or so, model is excellent in the overall performance of test set, with oneself The outer estimated result of the bag of body is consistent.

Fig. 2 .4 the result shows that, the ROC curve of improved Random Forest model test set area AUC below 0.84 a left side The right side, model are excellent in ROC curve.The ROC curve area AUC below estimated outside original Random Forest model bag is optimal It as a result is 0.77 or so.Improved Random Forest model is better than original Random Forest model in the performance of AUC value.

3. model compares

3.1 compared with logistic regression and support vector machines

Since what is compared is performance between different models, it is therefore desirable to keep the consistency of data.The data of use are all For the data up-sampled.Table 3.1,3.2 and 3.3 is three models precision that training set and test set divide three times, is looked into The value of quasi- rate, recall rate and F1.Fig. 3 .1,3.2 and 3.3 are the training set three times of three models and the ROC that test set divides Curve and AUC value.

Table 3.1

Table 3.2

Table 3.3

Table 3.1 the result shows that, by up-sampling after, the precision 81.22% of random forest, higher than logistic regression 72.31%, higher than the 78.27% of support vector machines；The precision ratio 80.25% of random forest, higher than logistic regression 77.14%, higher than the 78.53% of support vector machines；The recall rate 81.31% of random forest, higher than the 71.85 of logistic regression, Higher than the 78.14% of support vector machines；The F1 of random forest is 80.76%, higher than the 74.18% of logistic regression, is higher than and supports The 78.33% of vector machine.

Table 3.2 the result shows that, by up-sampling after, the precision 80.76% of random forest, higher than logistic regression 72.52%, higher than the 77.51% of support vector machines；The precision ratio 80.45% of random forest, higher than logistic regression 77.43%, higher than the 78.58% of support vector machines；The recall rate 80.83% of random forest, higher than logistic regression 71.15%, higher than the 77.19% of support vector machines；The F1 of random forest is 80.64%, higher than the 74.31% of logistic regression, Higher than the 77.88% of support vector machines.

Table 3.3 the result shows that, by up-sampling after, the precision 80.57% of random forest, higher than logistic regression 72.48%, higher than the 79.11% of support vector machines；The precision ratio 81.11% of random forest, higher than logistic regression 77.21%, higher than the 79.08% of support vector machines；The recall rate 80.39% of random forest, higher than logistic regression 71.82%, higher than the 79.16% of support vector machines；The F1 of random forest is 80.75%, higher than the 74.36% of logistic regression, Higher than the 79.12% of support vector machines.

Fig. 3 .1 shows that the AUC value of improved Random Forest model is 0.85, higher than the 0.79 of logistic regression, is higher than and supports The 0.82 of vector machine.

Fig. 3 .2 shows that the AUC value of improved Random Forest model is 0.83, higher than the 0.78 of logistic regression, is higher than and supports The 0.80 of vector machine.

Fig. 3 .3 shows that the AUC value of improved Random Forest model is 0.83, higher than the 0.78 of logistic regression, is higher than and supports The 0.80 of vector machine.

More improved Random Forest model and Logic Regression Models, precision ratio, are recalled at the precision of supporting vector machine model Rate, F1 value and AUC value as a result, improved Random Forest model is better than Logic Regression Models and support vector machines mould comprehensively Type shows improved random forest under same data set better than Logic Regression Models and supporting vector machine model.

3.2 compared with original Random Forest model

Due to having the process up-sampled to training data in improved Random Forest model, with test set come Evaluate the performance of two models.It carries out training set and test set three times to initial data to divide, and with original training data structure Original Random Forest model is built, training data is up-sampled three times, then constructs improved Random Forest model.Table 3.4,3.5 And 3.6 be to improve former and later two models test set precision three times, the value of precision ratio, recall rate and F1.Fig. 3 .4,3.5 and 3.6 be the ROC curve and AUC value for improving the test set three times of front and back model.

Table 3.4

Table 3.5

Table 3.6

Table 3.4, table 3.5 and table 3.6 the result shows that, improved Random Forest model in precision, precision ratio, recall Rate and the value of F1 are better than the model of the random forest before improving.

Fig. 3 .4, Fig. 3 .5 and Fig. 3 .6's the result shows that, before the AUC value of improved Random Forest model is than improving The AUC value of the model of random forest is high by 0.09 or so, shows that the performance of model has and is promoted by a relatively large margin.

Table 3.4~3.6 and .4~3.6 Fig. 3 show improved Random Forest model in performance comprehensively better than it is original with Machine forest model, improved plan are practicable.

Improved Random Forest model with original Random Forest model, Logic Regression Models and support vector machines mould After type compares, obtain the conclusion of best performance, this shows improved Random Forest model can be transported in actual classification scene In the differentiation for using personnel's classification.

Claims

1. it is a kind of based on the method for discrimination for improving judgment criteria sorting algorithm, it is characterized in that step includes:

(1) acquisition data are first passed through as sample data, construct Random Forest model；

(2) again in actual classification scene, tested personnel's characteristic index data, the random forest obtained using step 1 are acquired Model carries out quick discrimination to characteristic index data, learns the classification of personnel to be measured；

The construction method of Random Forest model in the step (1) be first using original random forests algorithm building it is original with Machine forest model；Virgin forest model is improved using improved random forests algorithm again, obtains final random forest Model:

The construction step of original Random Forest model includes:

1) sample data for extracting same size data volume is put back to firstly, concentrating from the initial data of sample data；2) then, from A certain number of features, constitutive characteristic subset are extracted in the primitive character variable of sample data；3) finally, being obtained with step 1) The character subset not beta pruning that sample data and step 2) obtain constructs decision tree；4) step 1~3 are repeated) n times, form N decision Tree, decision tree is integrated, and using the criterion of majority ballot, is finally completed the building of Random Forest model；

In the step (2), the characteristic variable in the characteristic index data of personnel to be measured is inputted Random Forest model, at random Forest model is using the consistent result of the judgement of majority decision tree as final result；

Original Random Forest model is improved:

A, raw data set is first trained to the division of collection and test set, cross validation is carried out on training set, passes through intersection Verification result carrys out the performance of entry evaluation model and determines parameter；Again on test set assessment models performance；

B, in training set, firstly, calculate F1 statistic, select optimal F1 statistic and lower than 1 point 5 standard deviation with The model parameter of interior F1 statistic is as candidate parameter；

Then, nicety of grading is calculated in candidate parameter, select optimal nicety of grading and lower than 1 point 5 standard deviation with Interior nicety of grading, parameter combination corresponding to these niceties of grading is as candidate parameter combination；

In addition, calculate AUC in candidate parameter, optimal AUC is selected and lower than the AUC within 1 point 5 standard deviation, this Parameter combination corresponding to a little AUC is as candidate parameter combination；

Finally, substituting into candidate parameter in test set；Optimal parameter combination is showed as last in test set F1 statistic Parameter combination；Using the model performance of final parameter as the assessment of final model performance.

C, from data distribution angle is changed, using up-sampling or the strategy of down-sampling；

When data distribution is uneven, and the quantity of positive negative sample is not especially more, using up-sampling strategy, expand minority class The quantity of sample；

When data distribution imbalance, and positive negative sample quantity it is all many when, using down-sampling strategy, less majority class samples Quantity.