CN108416364A

CN108416364A - Integrated study data classification method is merged in subpackage

Info

Publication number: CN108416364A
Application number: CN201810097334.3A
Authority: CN
Inventors: 李勇明; 张�成; 王品; 李淋玉; 谭晓衡; 颜芳
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2018-01-31
Filing date: 2018-01-31
Publication date: 2018-08-17

Abstract

The present invention discloses a kind of subpackage fusion integrated study data classification method, includes the following steps：S1：It obtains data and forms training set and test set；S2：Training set is divided into K subset using Subspace partition module；S3：Corresponding a subset trains a sorter model；S4：Calculate the corresponding weight factor of each sorter model；S5：Testing data is inputted in each sorter model, the sample label of each sorter model output obtains classification results to the end with corresponding weight factor multiplication rear weight.Its effect is：Learn by subpackage and to sample in every sub-spaces, weaken influence of the overlapping area to sorter model in sample space, then each subset misclassification sample is enhanced, is transferred in next subset and learns again, increases sample utilization rate.The prediction of all subsets is weighted using more spatial weightings fusion integrated study module integrated, to further influence of the reduction overlapping area sample to sorter model, improves nicety of grading.

Description

Integrated study data classification method is merged in subpackage

Technical field

The invention belongs to the data Classification and Identification technologies in big data field, and in particular to a kind of subpackage fusion integrated study Data classification method.

Background technology

In big data field, data classification has a wide range of applications, such as medical diagnosis, Judgment by emotion, semantics recognition And image recognition etc..Common grader mainly uses：Random forest (RF) algorithm, K arest neighbors (KNN) algorithm, support to Amount machine (SVM) model, extreme learning machine (ELM) model etc..Although existing research is in feature extraction, feature learning and classification Device design etc. all makes great progress, but sample study is not often taken seriously.

By taking the diagnosis of Parkinson disease based on voice data as an example, in speech sample and preprocessing process, it may be adopted Collect equipment, the influence of the factors such as noise, there may be large error, shapes between finally obtained numerical value sample and actual sample At exceptional sample.Exceptional sample normally results in different classes of sample aliasing in sample space and forms overlapping region, overlapping region Sample may mislead sorter model.There is presently no results of study can prove grader mould of this part sample to foundation Type is advantageous or harmful.Or existing method delete this part sample or be regarded as it is important as other samples, and It does not account for weakening influence of these samples to grader by algorithm.

Invention content

Based on drawbacks described above, the present invention provides a kind of subpackage fusion integrated study data classification method, and it is right that this method passes through Sample space is learnt, influence of the reduction overlapping region sample to disaggregated model.First, by each sample in training set Centroid distance measure ratio is calculated as sample weights.Sample in training sample is arranged according to sample weights descending.So The training set sample of sequence is divided into several subsets successively afterwards.Secondly, using staying a cross validation (LOO) method pair one The wrong classification samples and error rate of subset are calculated, and go out a sub- sorter model using each trained.It is based on Sample weights in each subset calculate penalty factor, and the weight factor of subset is calculated by the error rate of the subset after LOO. In the learning process of all subsets, it is transmitted in next subset after the misclassification sample from previous subset is enhanced, Next subset is learnt again.Again, the power of each subset is calculated using the weight factor of subset and penalty factor Weight, and the test result of each sub-classifier is weighted using subset weight.By being carried out to sample in every sub-spaces Study, and each subset misclassification sample is enhanced, it is transferred in next subset and learns again, realized to existing with this There is making full use of for sample, increases sample utilization rate.Integrated study module is merged to the pre- of all subsets using more spatial weightings Survey is weighted integrated, to further influence of the reduction overlapping area sample to sorter model, improves nicety of grading.

To achieve the above object, specific technical solution of the present invention is as follows：

A kind of subpackage fusion integrated study data classification method, feature include the following steps：

S1：It obtains data and forms training set and test set；

S2：Training set is divided into K subset using Subspace partition module, K is the integer more than or equal to 2；

S3：Corresponding a subset trains a sorter model；

S4：Calculate the corresponding weight factor of each sorter model；

S5：Testing data is inputted in each sorter model, the sample label of each sorter model output with it is right The weight factor multiplication rear weight answered obtains classification results to the end.

Further, Subspace partition module described in step S2 uses power of the class heart distance metric ratio as sample Weight by calculating the class heart distance metric ratio of each sample in training set, and is lined up, finally successively by from big size order It is divided into K subset.

Further, step S3 carries out the training of sorter model using subspace sample delivery type training method, specifically For：

S31：Set subset T_kTrue tag representation be：Y_k=[y₁,y₂,…,y_j,…,y_s], it is tested using an intersection is stayed Demonstration is verified to obtain prediction label set to be L_k；

S32：Count subset T_kThe classification error rate error_rate of middle misclassification sample and subset,

S33：According toCalculate the grader mould of K trained The weight factor of type.

Further, classification error rate in step S32Its In：

w_jIndicate the class heart distance metric ratio of j-th of sample,Indicate subset T_kMiddle s The class heart distance metric ratio weighted value of a sample, weight (j) represent the initialization weight of j-th of sample；I(Y_k(j)≠L_k (j) indicate j-th of sample by misclassification.

Further, setting subset T_kIn by staying the misclassification sample set after a cross validation to beThe sample Next subset T is transmitted to after enhancing_k+1It is middle to be learnt again.

Further, the enhancement method of misclassification sample isWherein： It is the wrong original weight for dividing sample,It is the wrong weight for dividing sample after enhancing.

Further, it is weighted processing using more spatial weighting integrated study modules, specially：

S41：According toCalculate separately the penalty factor of K subset；

S42：According to weight_k=β_k·α_kCalculate the weight of each partitions of subsets device；

S43：Calculate the weight of the sample predictions label of each partitions of subsets device output.

The present invention remarkable result be：

The Subspace partition module that this method proposes is based on the concept wrapped in bagging algorithms, by training set according to one Fixed criterion is directly divided into several subsets, rather than random sampling is repeated as bagging algorithms, and duplicate removal is saved on algorithm Multiple sampling process reduces time complexity and weakens overlapping area according to sample distribution characteristic dividing subset in sample space Sample influence to other samples in training sorter model, between subspace sample delivery type training module with reference to The thought for the concept and grader weight calculation that sample enhances in Adaboost algorithm, to sample in every sub-spaces It practises, and each subset misclassification sample is enhanced, be transferred in next subset and learn again, realized to existing with this Sample makes full use of, and increases sample utilization rate；Finally integrated study module is merged to all subsets using more spatial weightings Prediction is weighted integrated, to further influence of the reduction overlapping area sample to sorter model, improves nicety of grading.

Description of the drawings

Fig. 1 is the control flow chart of the present invention；

Fig. 2 is data subpackage flow chart in Subspace partition module；

Fig. 3 is the class heart apart from schematic diagram calculation；

Fig. 4 is subspace sample delivery type training flow chart；

Fig. 5 is the flow chart of more spatial weighting integrated studies；

The classification accuracy average result of randomly drawing sample when Fig. 6 is different subsets number；

Fig. 7 show each subset weight and test set prediction result under different situations；

Fig. 8 is the impact of performance figure of algorithms of different in specific embodiment.

Specific implementation mode

In order to keep the technical problem to be solved in the present invention, technical solution and advantage clearer, below in conjunction with attached drawing and Specific embodiment is described in detail.

As shown in Figure 1, the present embodiment provides a kind of subpackages to merge integrated study data classification method, include the following steps：

S1：It obtains data and forms training set and test set；

S3：Corresponding a subset trains a sorter model；

S4：Calculate the corresponding weight factor of each sorter model；

The present embodiment applies this method to during diagnosis of Parkinson disease, by carrying out classification processing to voice data, Realize the early diagnosis and prediction of Parkinson's disease.Using to data set be " Training set ", by Sakar et al. provide, And it is downloaded from the machine learning data set library website University of California, Irvine (UCI).The data set is divided into two portions Point：With Parkinson's disease subject and health volunteer.Wherein suffering from the subject of Parkinson's disease has male 14, and women 6 Example；There are male 10, women 10 in health volunteer.Therefore, data set shares 40 subjects.Entirely data set includes 1040 samples, each sample have 26 features.It is worth noting that, each subject has 26 samples, these sample representations 26 different semantic tasks.

The above method can be divided into Subspace partition module (SP) in specific implementation, and subspace sample delivery type trains mould Block (TST) and more spatial weighting integrated study module (MWEL) three parts realize that SP modules are used to carry out dividing son to training set Collection.Sub-classifier model is trained using each subset and calculate the relevant parameter of subset in TST modules.With MWEL modules Fusion is weighted to the prediction label of all subsets, obtains final classification results.

As shown in Fig. 2, bagging algorithms by training set using there is the method for putting back to stochastical sampling multiple to generate New training set, each newly trained sample number is identical as the sample number of original training set.Then each new training set trains one A sorter model is simultaneously verified with test set.Finally, to the sorter model of each new training set by way of ballot Prediction label is weighted, and obtains final result.Obviously, the training set in bagging algorithms is obtained by random sampling , which results in the uncertainties of result.When carrying out classification experiments using bagging algorithms, usually experiment is repeated more It is secondary, using the average value of many experiments result as final result.The time that such experimentation undoubtedly increases model is complicated Degree.Subspace partition module (SP) proposed by the present invention is based on the concept wrapped in bagging algorithms, by training set according to one Fixed criterion is directly divided into several subsets, rather than random sampling is repeated as bagging algorithms.In this process, Training set sample is weighted with sample class heart distance ratio.

Assuming that the training set T comprising K class samples is expressed as T=[S₀；S₁；...；S_t；...；S_K], t=1,2 ..., K；Its The sample set that middle classification is t is：S_t=[s₁；s₂；...；s_i；...；s_m], i=1,2 ... ..m, m are subset sample number.S_t In i-th of sample be expressed as s_i=[f₁；...；f_j；...；f_n], i=1,2 ... ..n, f_jIndicate j-th of feature of the sample. As shown in figure 3, point B is class S_tCenter of a sample's point, coordinate representation isWherein For j-th of feature of i-th of sample.Point C is class S_tIn i-th of sample coordinate.Point A is other foreign peoples The central point of sample, is expressed as：

And haveα=∠ CDB, β=∠ CDA.Then there is sample s_iWith it is similar Center of a sample's point distance is：

It is with foreign peoples center of a sample point distance：

It is possible thereby to which the class heart distance metric ratio for acquiring sample i is：

As seen from the figure there is three kinds of situations in the class heart distance metric ratio wi of sample：

From the point of view of geometric angle, due to AD=DB in triangle, CD=DC, alpha+beta=180 °, therefore the length of line segment AC and BC Spend (i.e. d₀And d₁) closely bound up with the size of angle α and β.If α ＜ β, d₀＜ d₁, it is meant that w_i＜ 1；If α=β, Then triangle Δ ADC and Δ DBC is congruent triangles, then d₀=d₁And w_i=1；If α ＞ β, d₀＞ d₁, also with regard to table Show it is w_i＞ 1.

It is found by analysis, w values are bigger, and the aliasing degree between sample and other different classes of samples is bigger, in d₀Phase With in the case of, sampled point is remoter from other classification samples, and w values are smaller.Based on this, class heart distance metric ratio can be used for Indicate the aliasing degree of sample and other classification samples.In the ideal case, w is smaller, and representative sample is in structure sorter model In it is more advantageous, w is bigger, and represented sample may more mislead sorter model in entire sample space.Therefore, originally Invention uses weight of the class heart distance metric ratio as sample.The class heart of each sample in training set is calculated using formula (3) Distance metric ratio is worth to w, and is ranked up from big to small to training set sample by w, the training set after finally sample sorts It is averagely divided into K subset successively, we term it subpackages for this process.

Assuming that original training set is divided into K subset after Subspace partition module, can be expressed as：

T=[T₁,T₂,…,T_k,…,T_K], k=1,2 ..., K.

The sample weights of training set are expressed as：W=[W₁,W₂,…,W_k,…,W_K], wherein：

W_k=[w₁,w₂,…,w_j,wj,w_s], (j=1,2 ..., s) represent k-th of subset T_kWeight set, s tables Show this subset T_kSample size.

By the study found that after training set is divided into K subset, with the increase of subsequence number, each subset The separability of sample should show as being concave function.Subset separability is bigger, and the performance of the sub-classifier after training is better, whole The weight of sub-classifier should bigger in a model.

Step S3 carries out the training of sorter model using subspace sample delivery type training method, specially：

S33：According to：

Calculate the weight factor of the sorter model of K trained.

With reference to the method for thought and grader weight calculation that sample in Adaboost algorithm enhances, statistics subset T_kMiddle mistake The classification error rate of classification samples and subset, and calculate the weight factor of the sorter model of trained.

Classification error rate in step S32：

Wherein：w_jIndicate the class heart distance metric ratio of j-th of sample,Indicate subset T_kThe class heart of middle s sample Distance metric ratio weighted value, weight (j) represent the initialization weight of j-th of sample；I(Y_k(j)≠L_k(j) it indicates j-th Sample is by misclassification.

Assuming that by staying the misclassification sample set after a cross validation to beAnd in TST modules, mistake classification Sample set in sample next subset T is passed to after enhancing_k+1It is middle to be learnt again.Therefore, it transmits successively, accidentally The sample of classification always can in next subset re -training, to increase the utilization rate of sample.The flow of TST modules Figure is specific as shown in Figure 4.

As shown in figure 4, before dividing sample to be transferred in next subset the mistake of each subset, need to mistake divide sample into Row enhancing.Because in Subspace partition module, subset is divided according to the descending of sample weights.Therefore in each subset Sample weights successively decrease successively, enhancing sample just need reduce mistake divide sample class heart distance measurement ratio.However, being more worth It must consider, the wrong classification samples of previous subset may mistake classification again in next subset learning process. Therefore, it is necessary to inhibit the influence of wrong classification samples in previous subset to the grader weight of next trained.Together When, α_kIt is bigger, indicate subset T_kSample separability it is bigger, be put into next subset misclassification sample be more possible under the influence of The weight of a subset.So these misclassification samples should reduce the interference to model to the greatest extent.Sample enhancing in the present invention Mode is expressed as：

Wherein：It is the wrong original weight for dividing sample,It is the wrong weight for dividing sample after enhancing.Wherein sample barycenter Distance metric ratio (is referred to as sample weights).Since α_kAlways meet α_k>=0, then just there is exp (α_k) >=1 is always It sets up.Formula (8) can reduce the class heart distance metric ratio for accidentally dividing sample well, realize sample enhancing.Moreover, exp (α_k) be a monotonic increase function.α_kBigger, the new weight for accidentally dividing sample that formula (8) generates is smaller, to next height Collect α_k+1The influence of parameter is smaller.In this way, can prevent from having well the sample of larger separability to have compared with The increase of small separability subset weight.

Next, as shown in figure 5, the prediction label of all subsets is weighted using MWEL modules it is integrated, further Weaken influence of the overlapping area sample to sorter model, final nicety of grading is obtained by integrated study.Each subset instruction Weight of the experienced sorter model in entire model can be calculated using equation (5).However, the test in sample space The distribution of sample is totally unknown.Formula (5) cannot be used for indicating completely point in final mask by each trained The weight of class device.In order to improve robustness of the model to test set, needs a penalty factor to carry out antithetical phrase collection weight and carry out about Beam.Therefore, it is weighted processing using more spatial weighting integrated study modules in the present embodiment, specially：

S41：According toCalculate separately the penalty factor of K subset；

Pass through above-mentioned design, α_kBy β_kConstraint can improve the generalization ability of model.Assuming that the weight set of K subset It is expressed as Weight=[weight₁,weight₂,…,weight_K], and weight of the subset in entire model is to depend on Weight is calculated.If the λ in entire model_kThe weight for representing k-th subset, in order to ensureThen λ_k's Calculation is：

Further, in order to which the feasibility to the above method is verified, following experiment has also been devised in the present invention.

(1) because the sample space sample distribution situation of different data collection is different, it is therefore desirable to determine draw for data sets The optimum number of Molecule Set.Moreover, the quantity of subpackage cannot be too big, it can not be too small.Sample size in too big each subset Too small, training is insufficient；Inhomogeneous sample aliasing depth is too big in subset if too small, is unfavorable for sample classification.Therefore, Using subpackage proposed by the invention fusion integrated study sorting technique by after training set training pattern, in random slave training set It extracts 26 samples to be verified, statistical forecast accuracy rate.For data set used in the present embodiment, subpackage number is from 5-9 Between select, in the case of different packet numbers, 20 times experiment predictablity rates average result it is as shown in Figure 6.

It will be appreciated from fig. 6 that the best subpackage number of the used data set of the present embodiment is 7.In order to verify the sample of each subset Separability, in certain experimentation, the classification for having counted 7 subset subset samples in the case where staying a cross validation is accurate A concave function is substantially presented in the classification accuracy of true rate, the respective sample of seven subsets, has confirmed method part to each subset The analysis of performance.It can be seen that classification accuracy it is high subset sample separability it is big, the subset weight in entire model should be more Greatly；The low subset sample separability of classification accuracy is small, and the subset weight in entire model should be smaller.

The wrong transmission for dividing sample, increases the utilization rate of sample between subset.In certain experimentation, each subset is to next The mistake that a subset is transmitted divides number of samples to be denoted as N_i, it is transferred to next subset and then secondary wrong point of number is denoted as M_i+1, system The results are shown in Table 1 for meter：

Table 1：Next subset mistake is transmitted to divide sample number and transfer samples wrong score mesh compares again

Number of subsets	1	2	3	4	5	6	7
								N_i	2	8	57	64	26	18	1
M_i+1	—	0	4	13	4	1	1

Table 1 the result shows that the mistake transmitted each time divides sample, have most can correctly be divided in next subset Class, to prove, mistake divides the utilization ratio that the transmission of sample increases sample, realizes making full use of for available sample.

Because the weight parameter of a subset under the influence of the wrong transmission meeting for dividing sample, therefore counted context of methods respectively and existed There is no mistake that sample is divided to transmit, wrong point of sample transmits but no specimen enhances, and mistake divides sample to transmit and enhance.

Fig. 7 each subset weight and test set prediction result in the case of showing three kinds.Sample is divided to pass to verify mistake with this Pass the influence enhanced experimental result influence and sample to experimental result.

As seen in Figure 7, mistake is not transmitted to be divided to sample and transmit the wrong son for being divided to sample but not reinforcing two kinds of situations Collect weight, it is seen that after sample transmits, for the weighting curve of each subset closer to a concave function, improving on the whole can The weight for dividing the big subset of property, reduces the weight of the small subset of separability.Comparative sample enhancing is in non-reinforced situation, it is seen that Sample enhancing antithetical phrase collection weights influence and little on the whole.

For performance more of the invention, method 1 is conventional method as a result, being directly to be classified to obtain to data set 's；Method 2 is the result of bagging algorithms；Method 3 is to use SP modules in context of methods, after TST modules, to each height Collection prediction result carries out the result of ballot succession.Method 4 is to classify fully according to method proposed by the present invention, four kinds of methods The comparison curves of accuracy rate is as shown in Figure 8.

As seen in Figure 8, method proposed by the present invention is on the graders such as RF, SVM (linear) or SVM (RBF), Its classification performance is significantly improved.

Finally, it should be noted that the present embodiment description is merely a preferred embodiment of the present invention, the ordinary skill of this field Personnel under the inspiration of the present invention, without prejudice to the purpose of the present invention and the claims, as can making multiple types It indicates, such transformation is each fallen within protection scope of the present invention.

Claims

1. integrated study data classification method is merged in a kind of subpackage, feature includes the following steps：

S1：It obtains data and forms training set and test set；

S3：Corresponding a subset trains a sorter model；

S4：Calculate the corresponding weight factor of each sorter model；

S5：Testing data is inputted in each sorter model, the sample label of each sorter model output with it is corresponding Weight factor multiplication rear weight obtains classification results to the end.

2. integrated study data classification method is merged in subpackage according to claim 1, it is characterised in that：Described in step S2 Subspace partition module uses weight of the class heart distance metric ratio as sample, by the class for calculating each sample in training set Heart distance metric ratio, and be lined up successively by descending order, finally it is divided into K subset.

3. integrated study data classification method is merged in subpackage according to claim 1 or 2, it is characterised in that：Step S3 is adopted The training of sorter model is carried out with subspace sample delivery type training method, specially：

S31：Set subset T_kTrue tag representation be：Y_k=[y₁,y₂,…,y_j,…,y_s], using staying a cross-validation method It is verified to obtain prediction label set to be L_k；

S33：According toCalculate the sorter model of K trained Weight factor.

4. integrated study data classification method is merged in subpackage according to claim 3, it is characterised in that：Divide in step S32 Class error rateWherein：

w_jIndicate the class heart distance metric ratio of j-th of sample,Indicate subset T_kMiddle s sample This class heart distance metric ratio weighted value, weight (j) represent the initialization weight of j-th of sample；I(Y_k(j)≠L_k(j) table Show j-th of sample by misclassification.

5. integrated study data classification method is merged in subpackage according to claim 3, it is characterised in that：Set subset T_kIn By staying the misclassification sample set after a cross validation to beThe sample is transmitted to next subset T after enhancing_k+1 It is middle to be learnt again.

6. integrated study data classification method is merged in subpackage according to claim 5, it is characterised in that：Misclassification sample Enhancement method isWherein：It is the wrong original weight for dividing sample,It is to increase Mistake divides the weight of sample after strong.

7. integrated study data classification method is merged in subpackage according to claim 3, it is characterised in that：Added using more spaces Power integrated study module is weighted processing, specially：

S41：According toCalculate separately the penalty factor of K subset；