CN108416369A - Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment - Google Patents

Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment Download PDF

Info

Publication number
CN108416369A
CN108416369A CN201810132427.5A CN201810132427A CN108416369A CN 108416369 A CN108416369 A CN 108416369A CN 201810132427 A CN201810132427 A CN 201810132427A CN 108416369 A CN108416369 A CN 108416369A
Authority
CN
China
Prior art keywords
module
training
classification
component
component classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810132427.5A
Other languages
Chinese (zh)
Inventor
蒋昌俊
闫春钢
刘关俊
丁志军
张亚英
张裕威
栾文静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201810132427.5A priority Critical patent/CN108416369A/en
Publication of CN108416369A publication Critical patent/CN108416369A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Abstract

Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment, including:Original training set is obtained, original training set is divided into the different training subsets that number is more than or equal to two;The different component classification devices that number is more than or equal to two are established for each training subset;The output of each component classification device is trained as feature, generates integrated classifier;Obtain test sample, component classification result is obtained according to each component classification device successively class test sample, a new test sample is formed characterized by component classification result, final classification is obtained as a result, the present invention solves unbalanced data classification low precision of the existing technology, None- identified minority class, the distribution characteristics performance technical problem that validity is low and grader discrimination is relatively low with integrated classifier new test sample of classifying.

Description

Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment
Technical field
The present invention relates to a kind of unbalanced data classification methods, random based on Stacking and overturning more particularly to one kind Down-sampled data classification method, system, medium and equipment.
Background technology
The continuous expansion of application range and what is solved the problems, such as deepen continuously, new challenge and barrier in being studied with data mining Hinder and emerge one after another, produce a series of new problems, the classification of wherein imbalanced data sets is exactly one by the important of extensive concern Problem.What unbalanced data classification considered is the classification learning problem in the case of Different categories of samples number is unbalanced, current many machines The hypothesis of device learning algorithm or expected data collection have balanced class distribution or equal misclassification cost, therefore, when processing complexity When imbalanced data sets, these algorithms cannot effectively show the distribution characteristics of data, to seriously affect the performance of grader. Since two classification problems are most commonly seen in life, therefore this technology is solved just for two classification problems.It solves at present uneven The technology of weighing apparatus data problem mainly solves data plane and algorithm level.Data plane major technique is resampling skill Art, algorithm level major technique are Multiple Classifier Fusion technology.Resampling technique is by increasing minority class sample or eliminating most classes Sample carrys out equilibrium criterion.Existing resampling technique is to being all to allow most class number of samples than or equal to few after data resampling Several classes of number of samples.But traditional machine learning algorithm can be partial to predict most class samples, cause minority class sample can not be by Accurately classification, influences final classification device performance;Data set by being divided into the data of multiple equilibriums by Multiple Classifier Fusion technology Subset, each data subset train a grader, then by certain combined strategy (such as ballot) by multiple grader groups It closes.But this performance that different classifications device cannot be distinguished is strong and weak, to which optimal classification effect be not achieved.
In conclusion the unbalanced data classification method of the prior art cannot effectively show the distribution characteristics of data, pass The machine learning algorithm of system can be partial to predict most class samples, and minority class sample is caused not classified accurately, influence most Whole classifier performance, the performance that different classifications device cannot be distinguished is strong and weak, and there are unbalanced data classification low precision, None- identified are few Several classes of, the distribution characteristics performance technical problem that validity is low and grader discrimination is relatively low.
Invention content
In view of the above prior art there are transaction security is low and the not high technical problem of authentication accuracy, the present invention Be designed to provide a kind of based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment, solve existing There are unbalanced data classification low precision, None- identified minority class existing for technology, distribution characteristics performance validity low and grader The relatively low technical problem of discrimination, one kind is based on Stacking and overturns random down-sampled sorting technique, including:
Original training set is obtained, original training set is divided into the different training subsets that number is more than or equal to two;
The different component classification devices that number is more than or equal to two are established for each training subset;
The output of each component classification device is trained as feature, generates integrated classifier;
Test sample is obtained, component classification is obtained as a result, with component according to each component classification device successively class test sample Classification results are characterized one new test sample of composition, and final classification result is obtained with the new test sample of integrated classifier classification.
In one embodiment of the present invention, original training set is obtained, original training set, which is divided into number, to be more than or equal to Two different training subsets specifically include:
Receive original training set D;
It is majority class sample set A and minority class sample set B by original training set D points;
Initialize frequency in sampling i and down-sampled number k;
Judge whether frequency in sampling i is less than down-sampled number k;
If so, recycling the most class samples extracted in most class sample set A with not putting back to, wherein extracting every time most The number n of class sample be n=ceil (| B |2/|A|);
With n most class samples a training subset D is constituted with all minority class samplesi
If it is not, then terminating to divide original training set.
In one embodiment of the present invention, the different components that number is more than or equal to two are established for each training subset Grader specifically includes:
Initialize ordered set C;
Judge whether to have obtained k training subset Di
If so, each training subset D of trainingiObtain m different component classification devices, k training subset of circuit training Di
If it is not, then terminating the training of component classification device;
Component classification device is sequentially placed into ordered set C, wherein the number of samples of ordered set C | C |=c, c=m × k;
Obtain checksum set V;
Verification classification results of classifying to obtain are carried out to the verification sample in checksum set V according to each component classification device;
Verification classification results are generated into new data set D ' as feature.
In one embodiment of the present invention, the output of each component classification device is trained as feature, spanning set Constituent class implement body includes:
Securing component grader and new data set D ';
With m kinds verification algorithm respectively to m ten folding cross validations of new data set D ' carry out;
Record m kinds algorithm obtains the AUC value of m in ten folding cross validations;
Compare m AUC value and obtain maximum AUC value, obtains maximum AUC value algorithm;
The corresponding component classification device of maximum AUC value algorithm is set as integrated classifier.
In one embodiment of the present invention, test sample is obtained, according to each component classification device successively class test sample This obtains component classification as a result, forming a new test sample characterized by component classification result, with the new test of integrated classifier classification Sample obtains final classification result and specifically includes:
Obtain test sample x, set C and integrated classifier comprising all components grader;
Component classification result of classifying to obtain is carried out to test sample x successively with each component classification device in set C;
All components classification results are used as to feature composition sample set T successively;
Classified to obtain final classification result to sample set T according to integrated classifier;
Export final classification result.
In one embodiment of the present invention, one kind is based on Stacking and overturns random down-sampled categorizing system, including: Data processing module, component classification device training module, integrated classifier training module and judgment module;Data processing module is used In obtaining original training set, original training set is divided into the different training subsets that number is more than or equal to two;Component classification Device training module, for establishing the different component classification devices that number is more than or equal to two, component classification for each training subset Device training module is connect with data processing module;Integrated classifier training module, for making the output of each component classification device It is characterized and is trained, generate integrated classifier, integrated classifier training module is connect with component classification device training module;Judge Module obtains component classification as a result, with component for obtaining test sample according to each component classification device successively class test sample Classification results are characterized one new test sample of composition, and final classification is obtained as a result, judging with the new test sample of integrated classifier classification Module is connect with component classification device training module, and judgment module is connect with integrated classifier training module.
In one embodiment of the present invention, data processing module includes:Original set acquisition module, most a small number of classification moulds Block, number initial module, frequency in sampling judgment module, cyclic samples module, training subset structure module and division terminate module; Original set acquisition module, for receiving original training set D;Most minority sort modules, for being majority by original training set D points Class sample set A and minority class sample set B, most minority sort modules are connect with original set acquisition module;Number introductory die Block, for initializing frequency in sampling i and down-sampled number k;Frequency in sampling judgment module, for judging that frequency in sampling i is No to be less than down-sampled number k, frequency in sampling judgment module is connect with number initial module;Cyclic samples module, for taking out When sample number is less than down-sampled number k, the most class samples extracted in most class sample set A are recycled with not putting back to, wherein often The secondary number n for extracting most class samples be n=ceil (| B |2/ | A |), cyclic samples module connects with frequency in sampling judgment module It connects;Training subset builds module, for being constituted a training subset D with all minority class samples with n most class samplesi, instruction Practice subset structure module to connect with cyclic samples module;Terminate module is divided, for secondary not less than down-sampled in frequency in sampling When number k, then terminate to divide original training set, divides terminate module and connect with frequency in sampling judgment module.
In one embodiment of the present invention, component classification device training module, including:Gather initial module, subset obtains Number judgment module, subset circuit training module, component trains terminate module, grader collection module, checksum set acquisition module, school Test object module and new data set generation module;Gather initial module, for initializing ordered set C;Subset obtains number and judges Module has obtained k training subset D for judging whetheri, subset obtain number judgment module with gather initial module connect;Son Collect circuit training module, for not yet obtaining k training subset DiWhen, each training subset D of trainingiObtain m different groups Part grader, k training subset D of circuit trainingi, subset circuit training module and subset obtain number judgment module and connect;Component Training terminate module, for obtaining k training subset DiWhen, terminate the training of component classification device, sets up training terminate module Number judgment module is obtained with subset to connect;Grader collection module, for component classification device to be sequentially placed into ordered set C, The wherein number of samples of ordered set C | C |=c, c=m × k, grader collection module are connect with subset circuit training module;School Collection acquisition module is tested, for obtaining checksum set V, checksum set acquisition module is connect with grader collection module;Check results module, For carrying out verification classification results of classifying to obtain, check results mould to the verification sample in checksum set V according to each component classification device Block is connect with checksum set acquisition module;New data set generation module generates new data for that will verify classification results as feature Collect D ', new data set generation module is connect with check results module.
In one embodiment of the present invention, integrated classifier training module, including:Integration trainingt input module intersects Authentication module, AUC value logging modle, AUC value comparison module and integrated classifier setting module;Integration trainingt input module is used In securing component grader and new data set D ';Cross validation module, for m kinds verification algorithm respectively to new data set D ' into M ten folding cross validations of row, cross validation module are connect with integration trainingt input module;AUC value logging modle, for recording m Kind algorithm obtains the AUC value of m in ten folding cross validations, and AUC value logging modle is connect with cross validation module;AUC values compare Module obtains maximum AUC value for comparing m AUC value, obtains maximum AUC value algorithm, AUC value comparison module and AUC value logging modle Connection;Integrated classifier setting module collects for setting the corresponding component classification device of maximum AUC value algorithm as integrated classifier Constituent class device setting module is connect with AUC value comparison module.
In one embodiment of the present invention, judgment module, including:Judge input module, set up classification results module, group Part sample set module, final classification object module and classification judging result output module;Input module is judged, for being tested Sample x, set C and integrated classifier comprising all components grader;Classification results module is set up, for in set C Each component classification device successively to test sample x classify component classification as a result, set up classification results module with judge it is defeated Enter module connection;Component sample set module, for being used as feature to form sample set T, component successively all components classification results Sample set module is connect with component classification object module;Final classification object module is used for according to integrated classifier to sample set T Classify final classification as a result, final classification object module is connect with component sample set module;Classification judging result output module, For exporting final classification as a result, classification judging result output module is connect with final classification object module.
In one embodiment of the present invention, a kind of computer readable storage medium is stored thereon with computer program, should It is realized when program is executed by processor based on Stacking and the random down-sampled sorting technique of overturning.
In one embodiment of the present invention, one kind is based on Stacking and overturns random down-sampled sorting device, including: Processor and memory;Memory is used to execute the computer program of memory storage for storing computer program, processor, So as to be executed based on Stacking and the random down-sampled classification of overturning based on Stacking and the random down-sampled sorting device of overturning Method.
As described above, the present invention can effectively be overcome using the random down-sampled and Stacking integrated studies technology of overturning Disadvantage mentioned above.By overturning random down-sampled technology, that is, the ratio of minority class and most class samples is overturn, resampling can be made up The defect that minority class can not be accurately identified caused by technology also overcomes Multiple Classifier Fusion using Stacking integrated study technologies The disadvantage of different classifications device performance power cannot be distinguished in technology.The present invention is directed to unbalanced data problem, proposes that one kind is based on Stacking and the random down-sampled unbalanced data classification technology of overturning.The effect of the classification to unbalanced data can be improved in the technology The foundation of fruit is grader can be biased to most class samples in training, so as to cause most classes by the likelihood ratio minority class of misclassification It is much lower by the probability of misclassification.Allow the number of minority class sample that the identification to minority class just can be improved more than most class samples Rate, in conjunction with the multiple sorter models of Stacking integrated study technological incorporation, to improve the performance of grader totality.
To sum up, the present invention provides one kind based on Stacking and overturns random down-sampled sorting technique, system, medium and set It is standby, solve unbalanced data classification low precision of the existing technology, None- identified minority class, distribution characteristics performance validity The low and relatively low technical problem of grader discrimination.
Description of the drawings
One kind that Fig. 1 is shown as the present invention is based on Stacking and the random down-sampled sorting technique step schematic diagram of overturning.
Fig. 2 is shown as the particular flow sheets of step S1 in one embodiment in Fig. 1.
Fig. 3 is shown as the particular flow sheets of step S2 in one embodiment in Fig. 1.
Fig. 4 is shown as the particular flow sheets of step S3 in one embodiment in Fig. 1.
Fig. 5 is shown as the particular flow sheets of step S4 in one embodiment in Fig. 1.
One kind that Fig. 6 is shown as the present invention is based on Stacking and the random down-sampled categorizing system module diagram of overturning.
Fig. 7 is shown as the specific module diagram of data processing module 11 in one embodiment in Fig. 6.
Fig. 8 is shown as the specific module diagram of component classification device training module 12 in one embodiment in Fig. 6.
Fig. 9 is shown as the specific module diagram of integrated classifier training module 13 in one embodiment in Fig. 6.
Figure 10 is shown as the specific module diagram of judgment module 14 in one embodiment in Fig. 6.
Component label instructions
1 based on Stacking and the random down-sampled categorizing system of overturning
11 data processing modules
12 component classification device training modules
13 integrated classifier training modules
14 judgment modules
111 original set acquisition modules
112 most a small number of sort modules
113 number initial modules
114 frequency in sampling judgment modules
115 cyclic samples modules
116 training subsets build module
117 divide terminate module
121 set initial modules
122 subsets obtain number judgment module
123 subset circuit training modules
124 component trains terminate modules
125 grader collection modules
126 checksum set acquisition modules
127 check results modules
128 new data set generation modules
131 integration trainingt input modules
132 cross validation modules
133 AUC value logging modles
134 AUC value comparison modules
135 integrated classifier setting modules
141 judge input module
142 set up classification results module
143 component sample set modules
144 final classification object modules
145 classification judging result output modules
Step numbers explanation
S1~S4 method and steps
S11~S17 method and steps
S21~S28 method and steps
S31~S35 method and steps
S41~S45 method and steps
Specific implementation mode
Illustrate that embodiments of the present invention, those skilled in the art can be by this explanations by particular specific embodiment below Content disclosed by book understands other advantages and effect of the present invention easily.
It please refers to Fig.1 to Figure 10, it should however be clear that the structure depicted in this specification institute accompanying drawings, only coordinating specification Revealed content is not limited to the enforceable restriction item of the present invention so that those skilled in the art understands and reads Part, therefore do not have technical essential meaning, the modification of any structure, the change of proportionate relationship or the adjustment of size are not influencing Under the effect of the utility model can be generated and the purpose that can reach, should all still it fall in disclosed technology contents institute In the range of capable of covering.Meanwhile in this specification it is cited as " on ", " under ", " left side ", " right side ", " centre " and " one " Term is merely convenient to being illustrated for narration, rather than to limit the scope of the invention, the change of relativeness or tune It is whole, in the case where changing technology contents without essence, when being also considered as the enforceable scope of the present invention.
Referring to Fig. 1, the one kind for being shown as the present invention is based on Stacking and the random down-sampled sorting technique step of overturning Schematic diagram, as shown in Figure 1, it is a kind of based on Stacking and the random down-sampled sorting technique of overturning, including:
S1, original training set is obtained, original training set is divided into the different training subsets that number is more than or equal to two, The lack of uniformity of raw data set is handled, by original training set being divided into multiple and different training subsets, each The number of minority class sample is more than the number of most class samples in training subset;
S2, the different component classification devices that number is more than or equal to two are established for each training subset, at each data Ready-portioned training subset in module is managed, a certain number of component classification devices are trained to each training subset;
S3, the output of each component classification device is trained as feature, integrated classifier is generated, when there are one new When sample needs classification, classified successively to it with each component classification device first, and classification results are formed as feature A new sample is classified the sample with integrated classifier, uses Stacking integrated study technological incorporation all components later Grader generates final integrated classifier.When new samples x needs classification, all component classification devices provide sentencing for oneself Determine as a result, and these are judged to generate final classification results in feeding integrated classifiers;
S4, test sample is obtained, component classification is obtained as a result, with group according to each component classification device successively class test sample Part classification results are characterized one new test sample of composition, and final classification is obtained as a result, working as with the new test sample of integrated classifier classification When having test sample to need to be classified, determination module works at this time, and excess-three module temporarily ceases work, and this technology is suitable for Various imbalanced data sets, input data are original training set D, checksum set V and test sample x, and output data is to test specimens The classification results of this x.
Referring to Fig. 2, the particular flow sheets of step S1 in one embodiment in Fig. 1 are shown as, as shown in Fig. 2, step S1, Original training set is obtained, original training set, which is divided into different training subsets of the number more than or equal to two, to be specifically included:
S11, original training set D is received, is handled the lack of uniformity of training set D and generated multiple Sub Data Sets, it is false Collect as shown by the data in table 1 if having, which is the postoperative survival condition of patient with breast cancer;
The postoperative survival condition data set of 1 patient with breast cancer of table
Serial number Age It performs the operation the time Positive detection number Label
1 34 59 0 It does not survive
2 30 64 1 Survival
306 77 65 3 Survival
S12, it is that majority class sample set A and minority class sample set B, data set share 306 by original training set D points Bar sample, wherein 81, minority class sample (not surviving after operation), most class samples (hand postoperative survival) 225, data set contains There are 3 attributes, specific division methods to be:Assuming that A and B are respectively most class sample sets and minority in original training set D Class sample set, | A | and | B | the number of samples in set A and set B is indicated respectively;
S13, frequency in sampling i and down-sampled number k is initialized, the frequency in sampling of initialization is 0, and the number of extraction is K (k=1.5 × ceil (| A |/n));
S14, judge whether frequency in sampling i is less than down-sampled number k, A overturn it is random down-sampled, i.e., every time from N sample is randomly selected in A;
S15, if so, recycling the most class samples extracted in most class sample set A with not putting back to, wherein extracting every time The number n of most class samples be n=ceil (| B |2/ | A |), wherein the number n for extracting most class samples every time is n=ceil (|B|2/ | A |), | A |=158, | B |=56, n=ceil (562/ 158)=20, k=1.5 × ceil (158/20)=12, every time 56 in extraction (do not put back to when only extracting, but extracted from original A) 20 samples and B for not put back to from A every time Bar sample forms new data subset, contains 76 samples in each data subset;
S16, with n most class samples a training subset D is constituted with all minority class samplesi, (n=ceil (| B |2/|A |), ceil expressions take it is whole, similarly hereinafter) with all minority class sample B constitute a new training subset;
S17, if it is not, then terminating to divide original training set.
Referring to Fig. 3, the particular flow sheets of step S2 in one embodiment in Fig. 2 are shown as, as shown in Fig. 2, step S2, Establishing different component classification implement bodies of the number more than or equal to two for each training subset includes:
For storing trained component classification device in S21, initialization ordered set C, ordered set C;
S22, judge whether to have obtained k training subset Di, optionally, in the present embodiment, c=5 × 12=60. collects Close in C includes 60 component classification devices altogether.After all components grader predicts the sample in checksum set, it can be obtained similar The new data set D ', D ' of table 3 include 91 samples, 60 attributes, 1 label;
S23, if so, each training subset D of trainingiObtain m different component classification devices, k training of circuit training Subset Di, in present case, m=5 is selected, i.e., each data subset is learnt respectively with 5 sorting algorithms, 5 classification Algorithm is respectively decision tree, naive Bayesian, Bayesian network, logistic regression, support vector machines;
S24, it is trained if it is not, then terminating component classification device;
S25, component classification device is sequentially placed into ordered set C, wherein the number of samples of ordered set C | C |=c, c= m×k;
S26, checksum set V is obtained, multiple and different component classification devices is established to each training subset, then each component Output of the grader on checksum set trains a new integrated classifier as feature successively;
S27, verification classification results of classifying to obtain are carried out to the verification sample in checksum set V according to each component classification device, often A trained component classification device will classify to all samples in checksum set V;
S28, verification classification results are generated into new data set D ' as feature, and new number is generated using classification results as feature According to collection D ', the new data set D ' of generation is as shown in table 3;
The new data set D ' that table 3 generates
Referring to Fig. 4, the particular flow sheets of step S3 in one embodiment in Fig. 1 are shown as, as shown in figure 4, step S3, The output of each component classification device is trained as feature, generating Ensemble classifier implement body includes:
S31, securing component grader and new data set D ';
S32, with m kinds verification algorithm respectively to m ten folding cross validations of new data set D ' carry out, the Ensemble classifier of this technology Device training module respectively carries out it 10 folding cross validations to obtained new data set D ' with above-mentioned m different sorting algorithms;
S33, record m kinds algorithm obtain the AUC value of m in ten folding cross validations, in present case, with 5 sorting algorithms (decision tree, naive Bayesian, Bayesian network, logistic regression, support vector machines) is to 10 folding cross validation of new data set D ' carry out;
S34, compare m AUC value and obtain maximum AUC value, obtain maximum AUC value algorithm, in present case, obtained AUC value is most Big sorting algorithm is naive Bayesian;
S35, the corresponding component classification device of maximum AUC value algorithm is set as integrated classifier, by trained component classification Device passes through Stacking technological incorporation, so that it may to obtain integrated classifier meta.
Referring to Fig. 5, the particular flow sheets of step S4 in one embodiment in Fig. 1 are shown as, as shown in figure 5, step S4, Test sample is obtained, component classification is obtained according to each component classification device successively class test sample as a result, with component classification result It is characterized one new test sample of composition, obtaining final classification result with the new test sample of integrated classifier classification specifically includes:
S41, test sample x, set C and integrated classifier comprising all components grader are obtained;
S42, with each component classification device in set C successively to test sample x classify component classification as a result, head 60 component classification devices in the set C first obtained with component classification device training module successively classify to test sample x;
S43, all components classification results are used as to feature composition sample set T successively;
S44, according to integrated classifier to sample set T classify final classification as a result, in the present embodiment, all components point After the classification of class device, the new samples T of similar table 4 can be obtained;
4 new samples T of table
S45, output final classification are as a result, the Naive Bayes Classifier obtained later using integrated classifier training module Meta classifies to new samples T, and it is 0 (0 indicates not survive, and 1 indicates survival) to obtain classification results, i.e. test sample x is final Classification state be the patient do not survive.
Referring to Fig. 6, the one kind for being shown as the present invention is based on Stacking and the random down-sampled categorizing system module of overturning Schematic diagram, as shown in fig. 6, based on Stacking and the random down-sampled categorizing system 1 of overturning, including:Including:Data processing module 11, component classification device training module 12, integrated classifier training module 13 and judgment module 14;Data processing module 11, is used for Original training set is obtained, original training set is divided into the different training subsets that number is more than or equal to two, to initial data The lack of uniformity of collection is handled, by original training set being divided into multiple and different training subsets, in each training subset The number of minority class sample is more than the number of most class samples;Component classification device training module 12, for being each training subset It establishes number and is more than or equal to two different component classification devices, to ready-portioned training subset in each data processing module, A certain number of component classification devices, component classification device training module 12 are trained to connect with data processing module 11 each training subset It connects;Integrated classifier training module 13, for the output of each component classification device to be trained as feature, spanning set ingredient Class device first successively classifies to it with each component classification device when needing classification there are one new sample, and classification As a result treat as a new sample of feature composition, classified the sample with integrated classifier later, use Stacking integrated studies Technological incorporation all components grader, generates final integrated classifier.When new samples x needs classification, all components point Class device provides the judgement of oneself as a result, and final classification results are generated in these judgement feeding integrated classifiers, collecting ingredient Class device training module 13 is connect with component classification device training module 12;Judgment module 14, for obtaining test sample, according to every Class test sample obtains component classification as a result, forming a new test specimens characterized by component classification result to one component classification device successively This, final classification is obtained as a result, judgment module is connect with component classification device training module with classify new test sample of integrated classifier, When there is test sample to need to be classified, determination module works at this time, and excess-three module temporarily ceases work, and this technology is applicable in In various imbalanced data sets, input data is original training set D, checksum set V and test sample x, and output data is to test The classification results of sample x, judgment module 14 are connect with integrated classifier training module 13.
Referring to Fig. 7, being shown as the specific module diagram of data processing module 11 in one embodiment, such as Fig. 7 in Fig. 6 Shown, data processing module 11 includes:Original set acquisition module 111, most a small number of sort modules 112, number initial module 113, frequency in sampling judgment module 114, cyclic samples module 115, training subset structure module 116 and division terminate module 117; Original set acquisition module 111 handles the lack of uniformity of training set D and generates multiple sons for receiving original training set D Data set;Most minority sort modules 112, for being majority class sample set A and minority class sample set by original training set D points B is closed, data set shares 306 samples, wherein 81, minority class sample (not surviving after operation), and most class samples (are deposited after operation It is living) 225, data set contains 3 attributes, and specific division methods are:Assuming that A and B are respectively more in original training set D Several classes of sample sets and minority class sample set, | A | and | B | indicate that the number of samples in set A and set B, majority are few respectively Number sort module 112 is connect with original set acquisition module 111;Number initial module 113, for initialize frequency in sampling i and Down-sampled number k, the frequency in sampling of initialization are 0, and the number of extraction is k (k=1.5 × ceil (| A |/n));Frequency in sampling Judgment module 114, for judging whether frequency in sampling i is less than down-sampled number k, frequency in sampling judgment module 114 and number Initial module 113 connects, and to A overturn random down-sampled, i.e., randomly selects n sample from A every time;Cyclic samples module 115, it is more in the most class sample set A of extraction for when frequency in sampling is less than down-sampled number k, recycling with not putting back to Several classes of samples, wherein the number n for extracting most class samples every time be n=ceil (| B |2/ | A |), wherein extracting most classes every time The number n of sample be n=ceil (| B |2/ | A |), | A |=158, | B |=56, n=ceil (562/ 158)=20, k=1.5 × Ceil (158/20)=12, the extraction that do not put back to from A every time (are not put back to when only extracting, but are taken out from original A every time Taking) 56 samples in 20 samples and B form new data subset, 76 samples are contained in each data subset, cycle is taken out Egf block 115 is connect with frequency in sampling judgment module 114;Training subset builds module 116, for n most class samples and All minority class samples constitute a training subset Di, (n=ceil (| B |2/ | A |), ceil expressions take it is whole, similarly hereinafter) with it is all Minority class sample B constitute a new training subset, training subset structure module 116 connect with cyclic samples module 115;It draws Divide terminate module 117, for when frequency in sampling is not less than down-sampled number k, then terminating to divide original training set, dividing knot Beam module 117 is connect with frequency in sampling judgment module 114.
Referring to Fig. 8, being shown as the specific module signal of component classification device training module 12 in one embodiment in Fig. 7 Figure, as shown in figure 8, component classification device training module 12, including:Gather initial module 121, subset obtains number judgment module 122, Subset circuit training module 123, component trains terminate module 124, grader collection module 125, checksum set acquisition module 126, Check results module 127 and new data set generation module 128;Gather initial module 121, for initializing ordered set C, orderly For storing trained component classification device in set C;Subset obtains number judgment module 122, and k has been obtained for judging whether A training subset Di, subset obtains number judgment module 122 and connect with initial module 121 is gathered, optionally, in the present embodiment, c It includes 60 component classification devices altogether that=5 × 12=60., which is in set C,.All components grader carries out the sample in checksum set After prediction, new data set D ', D ' that similar table 3 can be obtained include 91 samples, 60 attributes, 1 label;Subset circuit training Module 123, for not yet obtaining k training subset DiWhen, each training subset D of trainingiObtain m different component classifications Device, k training subset D of circuit trainingi, subset circuit training module 123 obtains number judgment module 122 with subset and connect, at this In case, m=5 is selected, i.e., each data subset is learnt respectively with 5 sorting algorithms, 5 sorting algorithms are respectively certainly Plan tree, naive Bayesian, Bayesian network, logistic regression, support vector machines;Component trains terminate module 124, for obtaining K training subset DiWhen, terminate the training of component classification device, sets up training terminate module 124 and obtain number judgment module 122 with subset Connection;Grader collection module 125, for component classification device to be sequentially placed into ordered set C, the wherein sample of ordered set C This number | C |=c, c=m × k, grader collection module 125 are connect with subset circuit training module 123;Checksum set obtains mould Block 126 establishes each training subset multiple and different component classification devices for obtaining checksum set V, then each component point Output of the class device on checksum set is used as feature to train a new integrated classifier successively, checksum set acquisition module 126 and Grader collection module 125 connects;Check results module 127 is used for according to each component classification device to the verification in checksum set V Sample carries out verification classification results of classifying to obtain, and each trained component classification device will carry out all samples in checksum set V Classification, check results module 127 are connect with checksum set acquisition module 126;New data set generation module 128 divides for that will verify Class result generates new data set D ' as feature, and new data set generation module 128 is connect with check results module 127.
Referring to Fig. 9, being shown as the specific module of integrated classifier training module 13 in one embodiment in Fig. 6 Schematic diagram, as shown in figure 9, integrated classifier training module 13 includes:Integration trainingt input module 131, cross validation module 132, AUC value logging modle 133, AUC value comparison module 134 and integrated classifier setting module 135;Integration trainingt inputs mould Block 131 is used for securing component grader and new data set D ';Cross validation module 132, for right respectively with m kind verification algorithms The integrated classifier training module of m ten folding cross validations of new data set D ' carry out, this technology uses obtained new data set D ' Above-mentioned m different sorting algorithms respectively carry out it 10 folding cross validations, and cross validation module 132 inputs mould with integration trainingt Block 131 connects;AUC value logging modle 133 obtains the AUC values of m for recording m kinds algorithm in ten folding cross validations, at this In case, with 5 sorting algorithms (decision tree, naive Bayesian, Bayesian network, logistic regression, support vector machines) to new data Collect 10 folding cross validation of D ' carry out, AUC value logging modle 133 is connect with cross validation module 132;AUC values comparison module 134, Maximum AUC value is obtained for comparing m AUC value, obtains maximum AUC value algorithm, in present case, the obtained maximum classification of AUC value Algorithm is naive Bayesian, and AUC value comparison module 134 is connect with AUC value logging modle 133;Integrated classifier setting module 135, for setting the corresponding component classification device of maximum AUC value algorithm as integrated classifier, trained component classification device is led to Cross Stacking technological incorporation, so that it may which, to obtain integrated classifier meta, integrated classifier setting module 135 is compared with AUC value Module 134 connects.
Referring to Fig. 10, being shown as the specific module diagram of judgment module 14 in one embodiment in Fig. 6, such as Shown in Figure 10, judgment module 14, including:Judge input module 141, set up classification results module 142, component sample set module 143, final classification object module 144 and classification judging result output module 145;Input module 141 is judged, for being tested Sample x, set C and integrated classifier comprising all components grader;Classification results module 142 is set up, for in set C Each component classification device successively to test sample x classify component classification as a result, using component classification device to train mould first 60 component classification devices in the set C that block obtains successively classify to test sample x, set up classification results module 142 with Judge that input module 141 connects;Component sample set module 143, for being used as feature to form successively all components classification results Sample set T, component sample set module 143 are connect with component classification object module 142;Final classification object module 144 is used for root According to integrated classifier to sample set T classify final classification as a result, final classification object module 144 and component sample set module 143 connections;Judging result of classifying output module 145, for exporting final classification as a result, using integrated classifier to train mould later The Naive Bayes Classifier meta that block obtains classifies to new samples T, obtain classification results be 0 (0 indicate do not survive, 1 table Show survival), i.e. classification state final test sample x is that the patient is not survived, classification judging result output module 145 with it is final Classification results module 144 connects.
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor Based on Stacking and the random down-sampled sorting technique of overturning, one of ordinary skill in the art will appreciate that:Realize above-mentioned each side The all or part of step of method embodiment can be completed by the relevant hardware of computer program.Computer program above-mentioned can To be stored in a computer readable storage medium.When being executed, execution includes the steps that above-mentioned each method embodiment to the program; And storage medium above-mentioned includes:The various media that can store program code such as ROM, RAM, magnetic disc or CD.
One kind is based on Stacking and overturns random down-sampled sorting device, including:Processor and memory;Memory is used In storage computer program, processor is used to execute the computer program of memory storage, so as to be based on Stacking and overturning Random down-sampled sorting device is executed based on Stacking and the random down-sampled sorting technique of overturning, and memory may include random Access memory (RandomAccessMemory, abbreviation RAM), it is also possible to further include nonvolatile memory (non- Volatilememory), a for example, at least magnetic disk storage.Above-mentioned processor can be general processor, including center Processor (CentralProcessingUnit, abbreviation CPU), network processing unit (NetworkProcessor, abbreviation NP) etc.; It can also be digital signal processor (DigitalSignalProcessing, abbreviation DSP), application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, abbreviation ASIC), field programmable gate array (Field- ProgrammableGateArray, abbreviation FPGA) either other programmable logic device, discrete gate or transistor logic device Part, discrete hardware components.
In conclusion provided by the invention a kind of based on Stacking and overturning random down-sampled sorting technique, system, Jie Matter and equipment, have the advantages that:It is a kind of random down-sampled based on Stacking and overturning present invention aims at proposing Sorting technique is handled data using the random down-sampled technology of overturning, and making up traditional sampling technology, to be unable to Accurate Prediction few Several classes of defects carries out secondary study to component classification device result using Stacking integrated studies technology, distinguishes different classifications Device performance is strong and weak, improves classification accuracy, has good scalability, is suitable for a variety of different imbalanced data sets, this Invention provides a kind of trading activity profile structure and authentication method, system, medium and equipment, solve it is of the existing technology not Equalization data nicety of grading is poor, None- identified minority class, distribution characteristics performance validity are low and the lower skill of grader discrimination Art problem has very high commercial value and practicability.

Claims (12)

1. one kind is based on Stacking and the random down-sampled sorting technique of overturning, which is characterized in that including:
Original training set is obtained, the original training set is divided into the different training subsets that number is more than or equal to two;
The different component classification devices that number is more than or equal to two are established for each training subset;
The output of each component classification device is trained as feature, generates integrated classifier;
Test sample is obtained, component classification is obtained as a result, with described according to each component classification device successively class test sample Component classification result is characterized one new test sample of composition, is obtained with the integrated classifier classification new test sample and is finally divided Class result.
2. according to the method described in claim 1, it is characterized in that, the original training set of acquisition, described by the original instruction White silk collection is divided into different training subsets of the number more than or equal to two and specifically includes:
Receive the original training set D;
It is majority class sample set A and minority class sample set B by the original training set D points;
Initialize frequency in sampling i and down-sampled number k;
Whether frequency in sampling i is less than the down-sampled number k described in judging;
If so, recycling the most class samples extracted in most class sample set A with not putting back to, wherein described in extracting every time The number n of most class samples be n=ceil (| B |2/|A|);
With n most class samples a training subset D is constituted with all minority class samplesi
If it is not, then terminating to divide the original training set.
3. according to the method described in claim 2, it is characterized in that, it is described for each training subset establish number be more than etc. Include in two different component classification implement bodies:
Initialize ordered set C;
Judge whether to have obtained the k training subset Di
If so, each training subset D of trainingiObtain the m different component classification devices, k instructions of circuit training Practice subset Di
If it is not, then terminating the component classification device training;
The component classification device is sequentially placed into the ordered set C, wherein the number of samples of the ordered set C | C |= C, c=m × k;
Obtain checksum set V;
Verification classification results of classifying to obtain are carried out to the verification sample in the checksum set V according to each component classification device;
New data set D ' is generated using the verification classification results as feature.
4. according to the method described in claim 3, it is characterized in that, described using the output of each component classification device as spy Sign is trained, and is generated Ensemble classifier implement body and is included:
Obtain the component classification device and the new data set D ';
With m kinds verification algorithm respectively to described m ten folding cross validations of new data set D ' carry out;
Algorithm obtains the AUC value of m in ten folding cross validations described in record m kinds;
Compare m AUC value and obtain maximum AUC value, obtains maximum AUC value algorithm;
The corresponding component classification device of the maximum AUC value algorithm is set as the integrated classifier.
5. according to the method described in claim 1, it is characterized in that, the acquisition test sample, divides according to each component Class device successively class test sample obtain component classification as a result, characterized by the component classification result form a new test sample, Final classification result is obtained with the integrated classifier classification new test sample to specifically include:
Obtain test sample x, the set C comprising all component classification devices and the integrated classifier;
The component point of classifying to obtain is carried out to the test sample x successively with each component classification device in the set C Class result;
All component classification results are used as to feature composition sample set T successively;
Classified to obtain the final classification result to the sample set T according to the integrated classifier;
Export the final classification result.
6. one kind is based on Stacking and the random down-sampled categorizing system of overturning, which is characterized in that including:Data processing module, Component classification device training module, integrated classifier training module and judgment module;
The original training set is divided into number and is more than or equal to two by the data processing module for obtaining original training set A different training subsets;
The component classification device training module, it is different more than or equal to two for establishing number for each training subset Component classification device;
The integrated classifier training module, it is raw for the output of each component classification device to be trained as feature At integrated classifier;
The judgment module obtains group for obtaining test sample according to each component classification device successively class test sample Part classification results form a new test sample characterized by the component classification result, described in integrated classifier classification New test sample obtains final classification result.
7. system according to claim 6, which is characterized in that the data processing module includes:Original set acquisition module, Most minority sort modules, number initial module, frequency in sampling judgment module, cyclic samples module, training subset build module With division terminate module;
The original set acquisition module, for receiving the original training set D;
Most a small number of sort modules, for being majority class sample set A and minority class sample by the original training set D points Set B;
The number initial module, for initializing frequency in sampling i and down-sampled number k;
The frequency in sampling judgment module, for judging whether the frequency in sampling i is less than the down-sampled number k;
The cyclic samples module, for when the frequency in sampling is less than the down-sampled number k, recycling pumping with not putting back to Most class samples in most class sample set A are taken, wherein the number n for extracting most class samples every time is n= ceil(|B|2/|A|);
The training subset builds module, for being constituted a training with all minority class samples with n most class samples Subset Di
The division terminate module, for when the frequency in sampling is not less than the down-sampled number k, then terminating to divide institute State original training set.
8. system according to claim 7, which is characterized in that the component classification device training module, including:Set is initial Module, subset obtain number judgment module, subset circuit training module, component trains terminate module, grader collection module, verification Collect acquisition module, check results module and new data set generation module;
The set initial module, for initializing ordered set C;
The subset obtains number judgment module, and the k training subset D have been obtained for judging whetheri
The subset circuit training module, for not yet obtaining the k training subset DiWhen, each training of training Collect DiObtain the m different component classification devices, k training subset D of circuit trainingi
The component trains terminate module, for obtaining the k training subset DiWhen, terminate the component classification device instruction Practice;
The grader collection module, for the component classification device to be sequentially placed into the ordered set C, wherein described have The number of samples of ordered sets C | C |=c, c=m × k;
The checksum set acquisition module, for obtaining checksum set V;
The check results module, for being carried out to the verification sample in the checksum set V according to each component classification device Classify to obtain and verifies classification results;
The new data set generation module, for generating new data set D ' using the verification classification results as feature.
9. system according to claim 6, which is characterized in that the integrated classifier training module, including:Integration trainingt Input module, cross validation module, AUC value logging modle, AUC value comparison module and integrated classifier setting module;
The integration trainingt input module, for obtaining the component classification device and the new data set D ';
The cross validation module, for being tested respectively m ten foldings intersection of the new data set D ' carry out with m kinds verification algorithm Card;
The AUC value logging modle obtains the AUC value of m for recording algorithm described in m kinds in ten folding cross validations;
The AUC value comparison module obtains maximum AUC value for comparing m AUC value, obtains maximum AUC value algorithm;
The integrated classifier setting module, for setting the corresponding component classification device of the maximum AUC value algorithm as institute State integrated classifier.
10. system according to claim 6, which is characterized in that the judgment module, including:Judge input module, set up Classification results module, component sample set module, final classification object module and classification judging result output module;
The judgement input module, for obtaining test sample x, the set C comprising all component classification devices and the collection Constituent class device;
The establishment classification results module is used for each component classification device in the set C successively to the test Sample x carries out the component classification result of classifying to obtain;
The component sample set module, for all component classification results to be used as to feature composition sample set T successively;
The final classification object module, for being classified to obtain described final point to the sample set T according to the integrated classifier Class result;
The classification judging result output module, for exporting the final classification result.
11. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor It is realized when execution described in any one of claim 1 to 5 based on Stacking and the random down-sampled sorting technique of overturning.
12. one kind is based on Stacking and the random down-sampled sorting device of overturning, which is characterized in that including:Processor and storage Device;The memory is used to execute the computer program of the memory storage for storing computer program, the processor, So that described execute the base as described in any one of claim 1 to 5 based on Stacking and the random down-sampled sorting device of overturning In Stacking and the random down-sampled sorting technique of overturning.
CN201810132427.5A 2018-02-08 2018-02-08 Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment Pending CN108416369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810132427.5A CN108416369A (en) 2018-02-08 2018-02-08 Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810132427.5A CN108416369A (en) 2018-02-08 2018-02-08 Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment

Publications (1)

Publication Number Publication Date
CN108416369A true CN108416369A (en) 2018-08-17

Family

ID=63128139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810132427.5A Pending CN108416369A (en) 2018-02-08 2018-02-08 Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment

Country Status (1)

Country Link
CN (1) CN108416369A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109753742A (en) * 2019-01-11 2019-05-14 哈尔滨工业大学(威海) A kind of Fault Diagnosis of Aeroengines method, system based on unbalanced sample
CN110009045A (en) * 2019-04-09 2019-07-12 中国联合网络通信集团有限公司 The recognition methods of internet-of-things terminal and device
CN110070111A (en) * 2019-03-29 2019-07-30 国电南瑞科技股份有限公司 A kind of distribution line classification method and system
CN112116180A (en) * 2019-06-20 2020-12-22 中科聚信信息技术(北京)有限公司 Integrated scoring model generation method and device and electronic equipment
CN112434878A (en) * 2020-12-09 2021-03-02 同济大学 Cascade sample equalization-based seismic fluid prediction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530373A (en) * 2013-10-15 2014-01-22 无锡清华信息科学与技术国家实验室物联网技术中心 Mobile application classifying method under imbalanced perception data
CN107169518A (en) * 2017-05-18 2017-09-15 北京京东金融科技控股有限公司 Data classification method, device, electronic installation and computer-readable medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103530373A (en) * 2013-10-15 2014-01-22 无锡清华信息科学与技术国家实验室物联网技术中心 Mobile application classifying method under imbalanced perception data
CN107169518A (en) * 2017-05-18 2017-09-15 北京京东金融科技控股有限公司 Data classification method, device, electronic installation and computer-readable medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109753742A (en) * 2019-01-11 2019-05-14 哈尔滨工业大学(威海) A kind of Fault Diagnosis of Aeroengines method, system based on unbalanced sample
CN110070111A (en) * 2019-03-29 2019-07-30 国电南瑞科技股份有限公司 A kind of distribution line classification method and system
CN110009045A (en) * 2019-04-09 2019-07-12 中国联合网络通信集团有限公司 The recognition methods of internet-of-things terminal and device
CN112116180A (en) * 2019-06-20 2020-12-22 中科聚信信息技术(北京)有限公司 Integrated scoring model generation method and device and electronic equipment
CN112434878A (en) * 2020-12-09 2021-03-02 同济大学 Cascade sample equalization-based seismic fluid prediction method
CN112434878B (en) * 2020-12-09 2022-09-20 同济大学 Cascade sample equalization-based seismic fluid prediction method

Similar Documents

Publication Publication Date Title
CN108416369A (en) Based on Stacking and the random down-sampled sorting technique of overturning, system, medium and equipment
CN109977028A (en) A kind of Software Defects Predict Methods based on genetic algorithm and random forest
US20200167466A1 (en) Data type recognition, model training and risk recognition methods, apparatuses and devices
Gupta et al. Breast cancer histopathological image classification: is magnification important?
CN106973057B (en) A kind of classification method suitable for intrusion detection
CN109299741B (en) Network attack type identification method based on multi-layer detection
CN103679160B (en) Human-face identifying method and device
CN108363810A (en) A kind of file classification method and device
CN109816044A (en) A kind of uneven learning method based on WGAN-GP and over-sampling
CN108304884A (en) A kind of cost-sensitive stacking integrated study frame of feature based inverse mapping
CN107273500A (en) Text classifier generation method, file classification method, device and computer equipment
CN108345904A (en) A kind of Ensemble Learning Algorithms of the unbalanced data based on the sampling of random susceptibility
CN109918584A (en) Bit coin exchange Address Recognition method, system, device
CN101739555A (en) Method and system for detecting false face, and method and system for training false face model
CN109491914A (en) Defect report prediction technique is influenced based on uneven learning strategy height
CN110442568A (en) Acquisition methods and device, storage medium, the electronic device of field label
CN112633337A (en) Unbalanced data processing method based on clustering and boundary points
CN103871044B (en) A kind of image signatures generation method and image authentication method and device
CN105975611A (en) Self-adaptive combined downsampling reinforcing learning machine
CN110135167A (en) A kind of edge calculations terminal security grade appraisal procedure of random forest
CN109934269A (en) A kind of opener recognition methods of electromagnetic signal and device
CN113922985A (en) Network intrusion detection method and system based on ensemble learning
CN108764346A (en) A kind of mixing sampling integrated classifier based on entropy
CN109800790A (en) A kind of feature selection approach towards high dimensional data
CN109347719A (en) A kind of image junk mail filtering method based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180817