CN109344907A - Based on the method for discrimination for improving judgment criteria sorting algorithm - Google Patents

Based on the method for discrimination for improving judgment criteria sorting algorithm Download PDF

Info

Publication number
CN109344907A
CN109344907A CN201811272036.XA CN201811272036A CN109344907A CN 109344907 A CN109344907 A CN 109344907A CN 201811272036 A CN201811272036 A CN 201811272036A CN 109344907 A CN109344907 A CN 109344907A
Authority
CN
China
Prior art keywords
model
random forest
data
forest model
parameter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811272036.XA
Other languages
Chinese (zh)
Inventor
顾海艳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201811272036.XA priority Critical patent/CN109344907A/en
Publication of CN109344907A publication Critical patent/CN109344907A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A kind of method of discrimination based on improvement judgment criteria sorting algorithm is to propose that one kind selectes random forest parameter based on more judging quotas by taking random forests algorithm as an example, up-samples the scheme that balance sample is distributed to construct new Random Forest model.By comparing improved Random Forest model and original Random Forest model, Logic Regression Models and supporting vector machine model, obtain the improved more excellent conclusion of Random Forest model performance, that is to say, it is bright based on more judging quota selected algorithm parameters be a kind of feasible scheme.Present method solves in the prior art, the usual maintenance data of judgement of actual scene classification excavate in sorting algorithm, but the problem of the sorting algorithm of usually data mining constructs model with single index, and the differentiation effect of model is not so good as people's will.

Description

Based on the method for discrimination for improving judgment criteria sorting algorithm
Technical field
It is specifically a kind of based on improving judgment criteria sorting algorithm the invention belongs to the application field of data mining technology Method of discrimination.
Background technique
Data mining technology plays an increasingly important role in life production, applies to speech recognition, and image is known Not, in the actual scenes such as commercial product recommending.Sorting algorithm therein is one of important support of data mining technology.One perfect Sorting algorithm can match in excellence or beauty perception of the mankind to things.But since still there are various each for present traditional classification algorithm The defect of sample can not effectively divide things so still perfect sorting algorithm cannot be deserved to be called under special scenes Class.Therefore, it is necessary to improve to traditional sorting algorithm, perfect sorting algorithm can be become closer to it.
Summary of the invention
To solve the above-mentioned problems, the present invention proposes a kind of new actual scene category classification method, the thinking of this method It is described below:
Random forests algorithm was proposed in 2001 by Breiman, as a kind of efficient discriminant classification method, was applied to each A field.The principle of random forest is that the forest of a decision tree is established with random manner, between every one tree in forest Almost without association (could also say that association is smaller).It, can be by inputting new sample after Random Forest model building finishes Eigen differentiates that the classification of sample to be tested, the precision of differentiation have raising by a relatively large margin for common decision tree.
The present invention is a kind of based on the method for discrimination for improving judgment criteria sorting algorithm, and step includes:
One, acquisition characteristics achievement data is first passed through as sample data, constructs Random Forest model;
Two, again in actual classification scene, the characteristic index data of personnel to be measured are acquired, are obtained using step 1 random Forest model carries out quick discrimination to characteristic index data, differentiates the classification of personnel to be measured.
Random forests algorithm in step 1:
1, original random forests algorithm
Single decision tree is there are error is larger and the risk of over-fitting, in order to solve the problems, such as that decision tree exists, Breiman proposed random forests algorithm in 2001.The core concept of random forests algorithm is,
1) sample data for extracting same size data volume is put back to firstly, concentrating from initial data;
2) a certain number of features then, are extracted from primitive character variable, are constituted into character subset;
3) finally, constructing decision tree with the sample data and character subset not beta pruning extracted.
It repeats above-mentioned three step and operates n times, form N decision tree, decision tree is integrated, using the criterion of majority ballot, most The building of Random Forest model is completed eventually.
New samples characteristic variable input model, Random Forest model using the consistent result of the judgement of majority decision tree as Final result.
Random forest is capable of handling high dimensional data, and it goes without doing Feature Selection, being capable of rapid build model.But model Only depended in training process it is estimating outside bag as a result, and model evaluation index it is single, with single evaluation index selection parameter, It is easy to cause the optimistic estimate of model performance.When sample data imbalance, it is easy to tend to most class samples, minority class is sentenced Other effect is poor.Therefore it needs to overcome three disadvantages above.The invention proposes improved random forests algorithms.
2, improved random forests algorithm
Be directed to only depended on existing for original Random Forest model the result estimated outside bag, model evaluation index it is single and Model tends to the problem of most class samples when sample imbalance, and the present invention proposes a kind of improved random forests algorithm.
A, for the improvement for only depending on estimated result outside bag.
The judge of original random forest only depends on to be estimated outside bag, this will be easy to cause the optimistic estimate of assessment result. In order to overcome this drawback,
Data are first trained the division of collection and test set by the present invention, and cross validation is carried out on training set, pass through friendship Fork verification result carrys out the performance of entry evaluation model and determines parameter;Again on test set assessment models performance.
The Performance Evaluation of both cross validation and test set is estimated to model performance outside bag better than only depending on to rely only on Assessment.
B, the improvement single for model performance evaluation index.
The model evaluation of original random forest only depends on single evaluation index, cannot be effectively anti-in evaluation process Answer sample class uneven or sample class importance this information.In order to overcome this drawback,
The present invention is in model training stage:
Firstly, calculating F1 statistic, optimal F1 statistic is selected and lower than the F1 system within 1 point 5 standard deviation The model parameter of metering;
Then, nicety of grading is calculated in previous step candidate parameter, selects optimal nicety of grading and lower than a bit Nicety of grading within five standard deviations;Parameter combination corresponding to these niceties of grading is as candidate parameter combination;
Finally, calculating AUC in a upper candidate parameter, select within optimal AUC and lower than 1 point 5 standard deviation AUC, parameter combination corresponding to these AUC is as candidate parameter combination.
Candidate parameter in above-mentioned steps is substituted on test set, F1 statistic shows optimal parameter combination on test set As last parameter combination.Using the model performance of final parameter as the assessment of final model performance.
C, it is partial to the improvement that most classes differentiate for sample data imbalance model.
Here mainly from data distribution angle is changed, main policies have up-sampling and down-sampling.When data distribution not Balance, and when the quantity of two classifications is not especially more, using up-sampling strategy, expand the quantity of minority class sample;Work as number It is uneven according to distribution, and two classifications quantity it is all many when, using down-sampling strategy, the quantity of less majority class samples.
In the prior art: comprehensive evaluation index F1 is the harmonic average of accurate rate (also referred to as precision ratio) P and recall rate R Number.AUC is area under ROC curve.
Method of the invention applies to Random Forest model in actual classification scene, for existing for Random Forest model Deficiency and primary data sample are unevenly distributed the actual conditions of weighing apparatus, by being improved to existing random forests algorithm, if Determine many indexes search optimized parameter, artificial sample is constructed to raw sample data, forms new data set.It is quasi- with optimized parameter Sample data is closed, new Random Forest model is constructed.The result shows that the performance based on improved Random Forest model is mentioned It rises, is suitable for actual classification scene.
Detailed description of the invention
Fig. 1 is the corresponding ROC curve figure of maximal accuracy in original random forests algorithm real example part;
Fig. 2 .1,2. and 2.3 are improved in Random Forest model real example part respectively, and the corresponding ROC of the AUC value of table 2.3 is bent Line;
Fig. 2 .4 is improved in Random Forest model real example part, three times test set ROC curve figure;
Fig. 3 .1,3.2 and 3.3 are respectively in model rating unit, and the training set three times and test set of three models divide ROC curve and AUC value.
Fig. 3 .4,3.5 and 3.6 are the ROC song that the test set three times of front and back model is improved in model rating unit respectively Line and AUC value.
Specific embodiment
The present invention is further described with specific embodiment with reference to the accompanying drawing.
1, original random forests algorithm real example
In order to show the improvement effect of model, select a classified sample set data as the data set being fitted, and And the positive and negative sample proportion of data data set is 1:3.Characteristic variable is feature1, feature2, feature3, Feature4, feature4, feature5, feature6, y.Wherein y is variable to be sorted.
1.1 data prediction
(1) multicollinearity is eliminated
Logarithm type characteristic variable feature1, feature2, feature3, feature4, feature4, feature5 Test for multi-collinearity is carried out, inspection result is as shown in table 1.1:
Table 1.1
As shown in table 1.1, the absolute value of the related coefficient between numeric type characteristic variable shows characteristic variable less than 0.5 Between linear dependence it is weaker, can will these characteristic variables substitute into Random Forest model in.
(2) degree of bias is corrected
Logarithm type characteristic variable carries out variable normal distribution and examines, and the index of selection is the degree of bias of variable.Each variable The degree of bias as shown in table 1.2:
Table 1.2
Since the degree of bias of feature1, feature2, feature3, feature5 are larger, so needing to these features Variable carries out degree of bias transformation, is converted here using Box-Cox.The transformed data degree of bias is as shown in table 3:
Table 1.3
Characteristic variable by transformation is more nearly normal distribution than original characteristic variable.
(3) it standardizes
Logarithm type variable is standardized transformation.By the data mean value that Box-Cox is converted, standard deviation such as 1.4 institute of table Show:
Table 1.4
By the mean value of the data of standardized transformation, standard deviation is as shown in table 1.5:
Table 1.5
Because subtype variable feature6 only has two states, therefore do not need to do it one-hot coding operation.
1.2 Random Forest models construction
The process of Random Forest model building is as follows:
(1) determine that characteristic variable sum is 6, the number m for constructing the characteristic variable of the character subset of single decision tree can be with It is 2,3,4;
(2) the tree n for determining forest tree, is set as 10,50,100,150,200,300,500;
(3) cartesian product for calculating the tree of character subset number and tree, obtains parameter combination (m, n);
(4) by each group of parameter fitting Random Forest model, 3 × 7=21 Random Forest model is obtained;
(5) precision estimated outside the bag of each Random Forest model is obtained, the highest parameter of choice accuracy is as optimal ginseng Array is closed;
(6) Random Forest model is fitted with best parameter group and total data.
Table 1.6 is the precision estimated outside whole Random Forest model bags under whole parameter combinations:
Table 1.6
Table 1.6 shows when the number of the characteristic variable of character subset is 3, and a number for random forest tree is 50, random gloomy Woods model enables to the precision estimated outside bag to reach maximum, and maximum precision is 78.09%.
Maximal accuracy corresponds to the precision estimated outside bag, precision ratio, and the value of recall rate and F1 statistic is as shown in table 1.7.
Table 1.7
The corresponding ROC curve of maximal accuracy is as shown in Figure 1, the value of AUC is 0.77.
The Random Forest model analysis that parameter combination is (3,50) is found, model accuracy 78.09%, precision ratio is 75.36%, recall rate 70.27%, F1 72.73%.The ROC value of model is 0.77.Due in sample data, negative sample Quantity is more than the quantity of positive sample, so this result occur meets reality.
The final result of the model of original random forest construction shows that the maximal accuracy of model is 78.09%, and precision ratio is 75.36%, recall rate 70.27%, F1 72.73%, it is contemplated that this result occur in this actual conditions of sample imbalance Meet reality.Since the model of original random forest building cannot effectively differentiate positive sample, it is therefore desirable to consideration pair Original random forests algorithm improves, and can take into account the differentiation of positive and negative two classes sample.It is answered in the index selection of model The multiple indexes integrate, rather than single index determines the parameter of model.
2, Random Forest model real example is improved
2.1 sample equilibratings
Since the sample distribution of data data is uneven, and the negligible amounts of positive negative sample, therefore be suitble to using up-sampling Method.The present invention mainly uses SMOTE algorithm to up-sample.
The basis of SMOTE (Synthetic Minority Oversampling Technique) algorithm is to cross to adopt at random Sample algorithm, but since random over-sampling is the simple copy to minority class sample, this will lead to the over-fitting of model.For The drawbacks of random over-sampling, the proposition of SMOTE algorithm first analyzes minority class sample, and synthesizes people based on the analysis results Work sample rather than simple copy.Algorithm flow is as follows:
(1) sample is calculated to minority class using Euclidean distance as module for each sample x of minority class The distance of whole samples, and its k neighbour is determined according to Euclidean distance.
(2) the uneven ratio n for calculating positive negative sample, determines the multiple n of sampling, neighbour is randomly choosed from k neighbour, false If the neighbour selected is y
(3) for each the neighbour y selected at random, new samples are constructed:
X_new=x+rand (0,1) × | x-y |
SMOTE algorithm is used to data data, data reach balance.The ratio of positive negative sample is approximately 1:1.
2.2 training sets and test set divide
Random forest is not due to that can have to be trained data collection and test set is divided there are estimation outside bag.But It is the optimistic estimate due to estimating to may result in model performance outside bag, more true model generalization performance in order to obtain Assessment, need to be trained data data collection and test set and divide.The division proportion of training set and test set is set as 3: 1.Data can be carried out with the division of training set and test set in triplicate, the assessment of the Generalization Capability of model is relatively reliable.
2.3 optimized parameters determine
(1) using the F1 statistic of initial data as evaluation index, first round screening is carried out to parameter combination.Table 2.1 is complete The F1 statistic estimated outside whole Random Forest model bags under portion's parameter combination.
Table 2.1
The maximum value of F1 statistic is 72.82%, standard deviation 2.5%, therefore is lower than 1 point 5 standard deviation of maximum value Range is 68.98%~72.82%, therefore candidate parameter combination has (2,10), (3,10), (3,50), (3,100), under One wheel screening.
(2) using initial data precision as evaluation index, the second wheel is carried out to parameter combination and is screened.Table 2.2 is waited in turn for second Select the precision estimated outside whole Random Forest model bags under parameter combination.
Table 2.2
Precision maximum value is 78.09%, standard deviation 2.1%, therefore is lower than 1 point 5 standard deviation range of maximum value 75.00%~78.09%, therefore candidate parameter combination has (3,10), (3,50), (3,100) are screened into next round.
(3) using initial data AUC value as evaluation index, third round screening is carried out to parameter combination.Table 2.3 is waited in turn for third Select the AUC value estimated outside whole Random Forest model bags under parameter combination.Fig. 2 .1,2.2 to Fig. 2 .3 are the corresponding ROC of AUC value Curve.
Table 2.3
AUC value is up to 0.77, standard deviation 0.05, thus lower than 1 point 5 standard deviation range of maximum value be 0.75~ 0.77, therefore candidate parameter combination has (3,50), (3,100) are screened into next round.
(4) using the F1 statistic of test set as evaluation index, fourth round screening is carried out to parameter combination.Table 2.4 is complete F1 statistic of whole Random Forest models on test set under portion's parameter combination.
Table 2.4
Table 2.4 shows to intend when the number of the characteristic variable of character subset is 3, and a tree for the tree of forest is 100 Close out the best Random Forest model of performance.Due to being to determine optimized parameter in initial data, still without solving positive and negative sample This unbalanced problem, it is therefore desirable in determining optimized parameter, final random forest is constructed on the data set of up-sampling Model.
2.4 models fitting
The process of models fitting is as follows:
(1) training set test set divides
(2) SMOTE algorithm construction artificial sample is carried out in training set, is added in initial data, forms new training set
(3) Random Forest model fitting is carried out in the parameter of new training set determination
The training set three times and test set division, the precision estimated outside bag on training set three times for carrying out total data look into standard Rate, recall rate and F1 statistic are as shown in table 2.5, and the result of test set prediction is as shown in table 2.6 three times.Test set three times ROC curve is as shown in Fig. 2 .4.
Table 2.5
Table 2.6
Table 2.5 the result shows that, overall precision is estimated outside the bag of improved Random Forest model 81% or so, precision ratio exists 81% or so, for recall rate in 80% or so, F1 statistic 80% or so, the overall performance that model is estimated outside bag is excellent.It is former Estimate outside beginning random forest-model bag overall precision 78% or so, precision ratio 75% or so, recall rate 70 or so, F1 statistic is 72% or so.Improved Random Forest model is in precision, precision ratio, recall rate and F1 statistic better than original Random Forest model.
Table 2.6 the result shows that, in test set overall precision 81% or so, precision ratio exists improved Random Forest model 80% or so, for recall rate in 80% or so, F1 statistic 80% or so, model is excellent in the overall performance of test set, with oneself The outer estimated result of the bag of body is consistent.
Fig. 2 .4 the result shows that, the ROC curve of improved Random Forest model test set area AUC below 0.84 a left side The right side, model are excellent in ROC curve.The ROC curve area AUC below estimated outside original Random Forest model bag is optimal It as a result is 0.77 or so.Improved Random Forest model is better than original Random Forest model in the performance of AUC value.
3. model compares
3.1 compared with logistic regression and support vector machines
Since what is compared is performance between different models, it is therefore desirable to keep the consistency of data.The data of use are all For the data up-sampled.Table 3.1,3.2 and 3.3 is three models precision that training set and test set divide three times, is looked into The value of quasi- rate, recall rate and F1.Fig. 3 .1,3.2 and 3.3 are the training set three times of three models and the ROC that test set divides Curve and AUC value.
Table 3.1
Table 3.2
Table 3.3
Table 3.1 the result shows that, by up-sampling after, the precision 81.22% of random forest, higher than logistic regression 72.31%, higher than the 78.27% of support vector machines;The precision ratio 80.25% of random forest, higher than logistic regression 77.14%, higher than the 78.53% of support vector machines;The recall rate 81.31% of random forest, higher than the 71.85 of logistic regression, Higher than the 78.14% of support vector machines;The F1 of random forest is 80.76%, higher than the 74.18% of logistic regression, is higher than and supports The 78.33% of vector machine.
Table 3.2 the result shows that, by up-sampling after, the precision 80.76% of random forest, higher than logistic regression 72.52%, higher than the 77.51% of support vector machines;The precision ratio 80.45% of random forest, higher than logistic regression 77.43%, higher than the 78.58% of support vector machines;The recall rate 80.83% of random forest, higher than logistic regression 71.15%, higher than the 77.19% of support vector machines;The F1 of random forest is 80.64%, higher than the 74.31% of logistic regression, Higher than the 77.88% of support vector machines.
Table 3.3 the result shows that, by up-sampling after, the precision 80.57% of random forest, higher than logistic regression 72.48%, higher than the 79.11% of support vector machines;The precision ratio 81.11% of random forest, higher than logistic regression 77.21%, higher than the 79.08% of support vector machines;The recall rate 80.39% of random forest, higher than logistic regression 71.82%, higher than the 79.16% of support vector machines;The F1 of random forest is 80.75%, higher than the 74.36% of logistic regression, Higher than the 79.12% of support vector machines.
Fig. 3 .1 shows that the AUC value of improved Random Forest model is 0.85, higher than the 0.79 of logistic regression, is higher than and supports The 0.82 of vector machine.
Fig. 3 .2 shows that the AUC value of improved Random Forest model is 0.83, higher than the 0.78 of logistic regression, is higher than and supports The 0.80 of vector machine.
Fig. 3 .3 shows that the AUC value of improved Random Forest model is 0.83, higher than the 0.78 of logistic regression, is higher than and supports The 0.80 of vector machine.
More improved Random Forest model and Logic Regression Models, precision ratio, are recalled at the precision of supporting vector machine model Rate, F1 value and AUC value as a result, improved Random Forest model is better than Logic Regression Models and support vector machines mould comprehensively Type shows improved random forest under same data set better than Logic Regression Models and supporting vector machine model.
3.2 compared with original Random Forest model
Due to having the process up-sampled to training data in improved Random Forest model, with test set come Evaluate the performance of two models.It carries out training set and test set three times to initial data to divide, and with original training data structure Original Random Forest model is built, training data is up-sampled three times, then constructs improved Random Forest model.Table 3.4,3.5 And 3.6 be to improve former and later two models test set precision three times, the value of precision ratio, recall rate and F1.Fig. 3 .4,3.5 and 3.6 be the ROC curve and AUC value for improving the test set three times of front and back model.
Table 3.4
Table 3.5
Table 3.6
Table 3.4, table 3.5 and table 3.6 the result shows that, improved Random Forest model in precision, precision ratio, recall Rate and the value of F1 are better than the model of the random forest before improving.
Fig. 3 .4, Fig. 3 .5 and Fig. 3 .6's the result shows that, before the AUC value of improved Random Forest model is than improving The AUC value of the model of random forest is high by 0.09 or so, shows that the performance of model has and is promoted by a relatively large margin.
Table 3.4~3.6 and .4~3.6 Fig. 3 show improved Random Forest model in performance comprehensively better than it is original with Machine forest model, improved plan are practicable.
Improved Random Forest model with original Random Forest model, Logic Regression Models and support vector machines mould After type compares, obtain the conclusion of best performance, this shows improved Random Forest model can be transported in actual classification scene In the differentiation for using personnel's classification.

Claims (1)

1. it is a kind of based on the method for discrimination for improving judgment criteria sorting algorithm, it is characterized in that step includes:
(1) acquisition data are first passed through as sample data, construct Random Forest model;
(2) again in actual classification scene, tested personnel's characteristic index data, the random forest obtained using step 1 are acquired Model carries out quick discrimination to characteristic index data, learns the classification of personnel to be measured;
The construction method of Random Forest model in the step (1) be first using original random forests algorithm building it is original with Machine forest model;Virgin forest model is improved using improved random forests algorithm again, obtains final random forest Model:
The construction step of original Random Forest model includes:
1) sample data for extracting same size data volume is put back to firstly, concentrating from the initial data of sample data;2) then, from A certain number of features, constitutive characteristic subset are extracted in the primitive character variable of sample data;3) finally, being obtained with step 1) The character subset not beta pruning that sample data and step 2) obtain constructs decision tree;4) step 1~3 are repeated) n times, form N decision Tree, decision tree is integrated, and using the criterion of majority ballot, is finally completed the building of Random Forest model;
In the step (2), the characteristic variable in the characteristic index data of personnel to be measured is inputted Random Forest model, at random Forest model is using the consistent result of the judgement of majority decision tree as final result;
Original Random Forest model is improved:
A, raw data set is first trained to the division of collection and test set, cross validation is carried out on training set, passes through intersection Verification result carrys out the performance of entry evaluation model and determines parameter;Again on test set assessment models performance;
B, in training set, firstly, calculate F1 statistic, select optimal F1 statistic and lower than 1 point 5 standard deviation with The model parameter of interior F1 statistic is as candidate parameter;
Then, nicety of grading is calculated in candidate parameter, select optimal nicety of grading and lower than 1 point 5 standard deviation with Interior nicety of grading, parameter combination corresponding to these niceties of grading is as candidate parameter combination;
In addition, calculate AUC in candidate parameter, optimal AUC is selected and lower than the AUC within 1 point 5 standard deviation, this Parameter combination corresponding to a little AUC is as candidate parameter combination;
Finally, substituting into candidate parameter in test set;Optimal parameter combination is showed as last in test set F1 statistic Parameter combination;Using the model performance of final parameter as the assessment of final model performance.
C, from data distribution angle is changed, using up-sampling or the strategy of down-sampling;
When data distribution is uneven, and the quantity of positive negative sample is not especially more, using up-sampling strategy, expand minority class The quantity of sample;
When data distribution imbalance, and positive negative sample quantity it is all many when, using down-sampling strategy, less majority class samples Quantity.
CN201811272036.XA 2018-10-30 2018-10-30 Based on the method for discrimination for improving judgment criteria sorting algorithm Pending CN109344907A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811272036.XA CN109344907A (en) 2018-10-30 2018-10-30 Based on the method for discrimination for improving judgment criteria sorting algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811272036.XA CN109344907A (en) 2018-10-30 2018-10-30 Based on the method for discrimination for improving judgment criteria sorting algorithm

Publications (1)

Publication Number Publication Date
CN109344907A true CN109344907A (en) 2019-02-15

Family

ID=65310923

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811272036.XA Pending CN109344907A (en) 2018-10-30 2018-10-30 Based on the method for discrimination for improving judgment criteria sorting algorithm

Country Status (1)

Country Link
CN (1) CN109344907A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222572A (en) * 2020-01-06 2020-06-02 紫光云技术有限公司 Office scene-oriented optical character recognition method
CN112257336A (en) * 2020-10-13 2021-01-22 华北科技学院 Mine water inrush source distinguishing method based on feature selection and support vector machine model
CN113283484A (en) * 2021-05-14 2021-08-20 中国邮政储蓄银行股份有限公司 Improved feature selection method, device and storage medium
CN113762712A (en) * 2021-07-26 2021-12-07 广西大学 Small hydropower cleaning rectification evaluation index screening strategy under big data environment
CN115512844A (en) * 2021-06-03 2022-12-23 四川大学 Metabolic syndrome risk prediction method based on SMOTE technology and random forest algorithm
CN116564409A (en) * 2023-05-06 2023-08-08 海南大学 Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer
CN117092525A (en) * 2023-10-20 2023-11-21 广东采日能源科技有限公司 Training method and device for battery thermal runaway early warning model and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931224A (en) * 2016-04-14 2016-09-07 浙江大学 Pathology identification method for routine scan CT image of liver based on random forests
CN108038448A (en) * 2017-12-13 2018-05-15 河南理工大学 Semi-supervised random forest Hyperspectral Remote Sensing Imagery Classification method based on weighted entropy
US20180246112A1 (en) * 2017-02-28 2018-08-30 University Of Kentucky Research Foundation Biomarkers of Breast and Lung Cancer

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105931224A (en) * 2016-04-14 2016-09-07 浙江大学 Pathology identification method for routine scan CT image of liver based on random forests
US20180246112A1 (en) * 2017-02-28 2018-08-30 University Of Kentucky Research Foundation Biomarkers of Breast and Lung Cancer
CN108038448A (en) * 2017-12-13 2018-05-15 河南理工大学 Semi-supervised random forest Hyperspectral Remote Sensing Imagery Classification method based on weighted entropy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘继辉: "基于随机森林回归的制丝过程参数影响权重分析", 《烟草科技》 *
肖坚: "一种基于随机森林的不平衡数据分类方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222572A (en) * 2020-01-06 2020-06-02 紫光云技术有限公司 Office scene-oriented optical character recognition method
CN112257336A (en) * 2020-10-13 2021-01-22 华北科技学院 Mine water inrush source distinguishing method based on feature selection and support vector machine model
CN113283484A (en) * 2021-05-14 2021-08-20 中国邮政储蓄银行股份有限公司 Improved feature selection method, device and storage medium
CN115512844A (en) * 2021-06-03 2022-12-23 四川大学 Metabolic syndrome risk prediction method based on SMOTE technology and random forest algorithm
CN115512844B (en) * 2021-06-03 2023-05-23 四川大学 Metabolic syndrome risk prediction method based on SMOTE technology and random forest algorithm
CN113762712A (en) * 2021-07-26 2021-12-07 广西大学 Small hydropower cleaning rectification evaluation index screening strategy under big data environment
CN113762712B (en) * 2021-07-26 2024-04-09 广西大学 Small hydropower cleaning rectification evaluation index screening strategy in big data environment
CN116564409A (en) * 2023-05-06 2023-08-08 海南大学 Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer
CN117092525A (en) * 2023-10-20 2023-11-21 广东采日能源科技有限公司 Training method and device for battery thermal runaway early warning model and electronic equipment
CN117092525B (en) * 2023-10-20 2024-01-09 广东采日能源科技有限公司 Training method and device for battery thermal runaway early warning model and electronic equipment

Similar Documents

Publication Publication Date Title
CN109344907A (en) Based on the method for discrimination for improving judgment criteria sorting algorithm
US10606862B2 (en) Method and apparatus for data processing in data modeling
CN107544253B (en) Large missile equipment retirement safety control method based on improved fuzzy entropy weight method
CN105630743B (en) A kind of system of selection of spectrum wave number
CN108897834A (en) Data processing and method for digging
CN110346831B (en) Intelligent seismic fluid identification method based on random forest algorithm
CN105373606A (en) Unbalanced data sampling method in improved C4.5 decision tree algorithm
CN106228389A (en) Network potential usage mining method and system based on random forests algorithm
CN106056136A (en) Data clustering method for rapidly determining clustering center
CN101957913B (en) Information fusion technology-based fingerprint identification method and device
CN110428270A (en) The potential preference client recognition methods of the channel of logic-based regression algorithm
CN109800810A (en) A kind of few sample learning classifier construction method based on unbalanced data
CN107784452A (en) A kind of objective integrated evaluating method of tobacco style characteristic similarity
CN110109902A (en) A kind of electric business platform recommender system based on integrated learning approach
CN110852600A (en) Method for evaluating dynamic risk of market subject
CN107239964A (en) User is worth methods of marking and system
CN110334773A (en) Model based on machine learning enters the screening technique of modular character
CN112396428A (en) User portrait data-based customer group classification management method and device
CN108344701A (en) Paraffin grade qualitative classification based on hyperspectral technique and quantitative homing method
CN113239199B (en) Credit classification method based on multi-party data set
Rofik et al. The Optimization of Credit Scoring Model Using Stacking Ensemble Learning and Oversampling Techniques
CN108776809A (en) A kind of dual sampling Ensemble classifier model based on Fisher cores
CN110222981B (en) Reservoir classification evaluation method based on parameter secondary selection
CN115481494B (en) Method for generating model line pedigree of Yangtze river all-line passenger ship
CN115186776B (en) Method, device and storage medium for classifying ruby producing areas

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190215

RJ01 Rejection of invention patent application after publication