CN105930856A

CN105930856A - Classification method based on improved DBSCAN-SMOTE algorithm

Info

Publication number: CN105930856A
Application number: CN201610169101.0A
Authority: CN
Inventors: 张春慨
Original assignee: Shenzhen Yitong Technology Co Ltd
Current assignee: Shenzhen Yitong Technology Co Ltd
Priority date: 2016-03-23
Filing date: 2016-03-23
Publication date: 2016-09-07

Abstract

The invention relates to a classification method based on an improved DBSCAN-SMOTE algorithm for an intra-class unbalanced condition in data sample space processing, firstly, in a data sample set, which are belongs to boundary samples is judged, the boundary samples are divided into majority boundary samples and minority boundary samples, and cluster is performed on the boundary samples in a majority boundary sample space; a PSO algorithm is adopted to optimize oversampling rate of the boundary samples and safe samples in cluster, oversampling with different sampling rates is performed on minority boundary samples through an SMOTE algorithm; wherein the cluster is based on the improved DBSCAN algorithm, the algorithm can generate cluster of minority, perform oversampling in the sample cluster, and can fully resolve the problem of uneven distribution and data fragment or small disjunct in intra-class unbalance.

Description

Based on the sorting technique improving DBSCAN-SMOTE algorithm

Technical field

The invention belongs to the sorting technique in data mining and optimize field, particularly relate to a kind of sorting technique being suitable for uneven sample.

Background technology

In imbalanced data classification issue problem, unbalanced data be exactly in whole data set sample space the sample of one type and remaining class or a few class sample quantitatively there is huge gap.The most often minority class needs us to put into more concern.Such as in the application in medical diagnosis, cancer or cardiopathic data sample space are exactly uneven sample, in this kind of sample, the object that we pay close attention to is often ill sample, the attribute of these samples is gone out by Accurate classification, such that it is able to the state of an illness of Accurate Diagnosis patient, and give and these patients immunotherapy targeted autoantibody timely.

Traditional classifier is in processing imbalanced data classification issue problem, in order to pursue higher global classification accuracy rate, often direct is most class samples by minority class sample classification, potential patient information can be categorized as health and fitness information by this way, therefore patient may miss the best period for the treatment of, causes the irremediable loss of patient.

In imbalanced data classification issue problem, there are the most again two kinds of imbalances: uneven uneven with in class between class.Between class, imbalance is exactly in the quantitative imbalance of data sample between most class and minority class；Uneven in class is exactly to there is again data sample distributing position or the imbalance of distribution density between each subclass in minority class sample space, that is to say and there is fragmentation of data and Density inhomogeneity problem.

The existing scheme for imbalanced data classification issue problem is numerous, these methods can be summarized as two big classes: one, solve problem from data plane, namely by the suitable method of sampling, increase minority class sample size, so that tending to balance in original sample space so that this problem is converted into traditional machine learning sample classification problem.Two, solve problem from algorithm aspect, namely improve traditional Machine learning classifiers algorithm so that new classifier algorithm can preferably identify the uneven sample that quantity is little.

The main thought of data plane solution imbalanced data classification issue is through resampling thus is reconstructed original uneven sample space, so that it becomes the state of relative equilibrium distribution so that after reconstruct, data sample can preferably be classified by applied code classifier algorithm.Solve, in data plane, the method that imbalanced data classification issue problem has a lot of multi-form to carry out resampling, mainly have the big class resampling methods of " oversampler method " and " lack sampling method " two.The such as over-sampling in minority class, at the lack sampling of most apoplexy due to endogenous wind or combine the mixing sampling of both approaches.

Researcher proposes a lot of effective resampling scheme, and monolateral selection method (one-side selection) is Kubat et al. proposes a kind of new lack sampling method in 1997 for sample vector apart from ideological once improvement.The main thought of the method is: on the basis of sample vector distance thought, combine KNN, majority class sample is classified as four kinds of primary categories, the safest sample, redundant samples, boundary sample and noise sample.Here, safe sample and redundant samples are those samples away from minority class with most class boundaries, and the classification difficulty of these samples is smaller, for boundary sample and noise sample, we term it " dangerous sample ", this kind of sample is needed to pay close attention to.Therefore, in the sampling process of monolateral selection method, by the data sample that the safe sample in most class samples and minority class sample synthetic are new, and the processing mode to " dangerous sample " is to use lack sampling method.

SMOTE (synthetic minority over-sampling technique) algorithm is a kind of oversampler method that application is more, Chawla et al. propose.Its main thought is to copy some samples between sample adjacent in minority class sample space.Therefore, over-fitting problem will be avoided by SMOTE algorithm, and the decision space of minority class also can extend more preferable；Equally, it can also be applied in most class sample space, can reduce the decision space of most class.The most a lot of oversampler methods is all the improvement carried out on the basis of SMOTE, Han et al. proposes Borderline-SMOTE, He and Li proposes NRSBoundary-SMOTE algorithm, and its main thought is to extend the classifying space of minority class and compress the classifying space of most class.

Yen and Lee proposed a kind of lack sampling method SBC based on cluster in 2009.This algorithm is thought, has certain diversity between different clustering cluster, it should different clustering cluster is taked different lack sampling strategies.The main thought of SBC algorithm is exactly: use traditional clustering algorithm that majority class sample clustering becomes k bunch, then according to bunch in most classes account for the ratio of all most classes and carry out the setting of lack sampling multiplying power, lack sampling is carried out, it is achieved delete the function of some most classes according to different sampling multiplying powers.

The biggest classifying quality and the successful application case of many has been had for solution imbalanced data classification issue problem improving of data plane.But, data plane is only that the angle being distributed in data goes to solve imbalanced data classification issue problem, does not goes to seek more preferable solution from the angle of sorting algorithm.So, increasing scholar starts to carry out seeking breakthrough in algorithm aspect, mainly by improving traditional classifier algorithm, allows them to adapt to imbalanced data classification issue problem.In algorithm aspect, main research is embodied in cost sensitive learning, integrated study and the classification learning algorithm of some other optimization.

In general, solving uneven most common a solution of classification problem is exactly to select appropriate induction bias.For decision tree, a kind of method is exactly the probability Estimation adjusting leaf node, and another method optimizes technology of prunning branches exactly；For SVM, it is simply that different classes is arranged different penalty factors, or adjusts Optimal Separating Hyperplane according to kernel function.In order to work out a kind of algorithm solution, one is to need knowledge and the application that corresponding grader is relevant, to be especially interpreted as what this learning algorithm is when running into the reason that data nonbalance can be failed completely.

Cost sensitive learning is to consider to give different mistake point costs by the sample of different misclassifications.Cost matrix represents and a certain class mistake is divided into the another kind of punishment cost being needs and paying, and allows C (Min, Maj) represent the cost size that minority class sample Min is mistakenly classified as majority class sample Maj by classifier algorithm.According to this thought, C (Min, Maj) represents cost when minority class sample misclassification majority class sample, C (Maj, Min) contrast.When solving uneven classification problem, the value information that the value information contained due to minority class sample is contained commonly greater than most class samples, therefore, in cost matrix, cost when misclassification minority class sample is greater than cost when misclassification majority class sample, i.e. C (Min, Maj) > C (Maj, Min).All is not 0 by the sample cost of misclassification, i.e. C (Maj, Maj)=C (Min, Min)=0.Conditional probability risk R of the most a certain sample (i | x) it is defined as follows shown in formula (1):

R (i | x)=∑_jP(j|x)C(i,j) (1)

Wherein P (j | x) representative sample x is classified as the posterior probability of j type.In sum, two crucial principles of cost sensitive learning are: minority class sample is greater than most class by the cost principle of misclassification and conditional risk minimization principle by the cost of misclassification, grader can focus more on, training when, the sample that misclassification cost is bigger, thus correct these sample points of classifying in next iteration.

The basic thought of Ensemble classifier learning method is to build multiple base graders, and each base grader carries out respective classification prediction to initial data, finally the result that each base grader is classified is collected statistics, thus uses Nearest Neighbor with Weighted Voting to carry out majority principle classification.It is exactly to improve their generalization ability by the mainspring that multiple base graders combine, each base grader is assumed to be it can occur the situation of misclassification in specific limited data set, and same example may be categorized as, by different base graders, the class label that differs.Why integrated study obtains the highest concern and application in imbalance classification field, is primarily due to it and possesses following unique advantage: based on the learning method added up；Process the superior function of large-scale data；Unique learning effect for imbalanced data classification issue；Dividing and ruling of data set processes and the combination of specific border sample.The unique advantage of above integrated study, has the natural suitability for solving imbalanced data classification issue.

The Main Function of Ensemble classifier learning method learns essentially by statistic bias or variance.For a given base grader, biasing variation decomposition all can be different in deviation, variance and elementary error, and the lifting in Ensemble classifier method performance the most mainly shows the reduction of variance.Bagging, Random forest and AdaBoost is a series of Ensemble classifier learnings method being successfully applied successfully reducing point variance by mistake.

AdaBoost algorithm is observed has the ability to reduce deviation.With the Study strategies and methods based on stub (each split tree only has two terminal nodes, and usual variance is relatively low but high deviation), pack performs very poor and AdaBoost algorithm and significantly improves the performance of base grader.For a unbalanced dataset, the learning method of standard is owing to introducing the most very poor of offset error performance.AdaBoost study algorithm reduce the ability of prejudice processing on class imbalance problem it is critical that.AdaBoost algorithm principal mode has: AdaC1, AdaC2 and AdaC3, is exactly in place of their main difference in the calculating of the weight coefficient at of base grader, all cost can be taken into account and is recalculated this in each interative computation.

As it has been described above, the method for sampling changes the distribution of sample data, thus solve the problem that criteria classification device algorithm cannot adapt to uneven sample data set.But, a lot of researchs still solve its problem for uneven sample data set classification accuracy underground from the angle of classifier algorithm.Such as, cost sensitive learning method is by arranging the most wrong point cost, thus seeks lowest costs and complete classification.

Traditional random lack sampling and random over-sampling have a lot of significantly shortcoming.Such as, random lack sampling is likely to remove some potential valuable samples, classifies inaccurate when causing imbalanced data classification issue；Random over-sampling may result in over-fitting problem.Main flow SMOTE the most there is also some obvious problems, and the data sample such as copied is little to the extension effect of decision space；Due to randomness, sometimes cannot take into account so-called " minority class noise sample " that traditional classifier is thought, thus diagnosis performance is caused large effect.

In real world, minority class sample data is often divided by mistake, or process as noise data, cost sensitive learning is that minority class sample data specifies higher mistake point cost, on the contrary, less mistake is set for most class sample datas and divides cost, the most just so that the sample giving classification difficulty in categorizing process bigger is more " concern ", thus accomplish with a definite target in view.It addition, in recent years, integrated study, as a kind of important machine learning method framework, is also more applied to imbalanced data classification issue.Integrated study is by being combined base grader, thus forms the integrated classifier that classification accuracy is higher.Improving for integrated study also has a lot, and such as Chawla N V et al. is by being integrated into SMOTE sampling algorithm in Boosting, thus has obtained SMOTEBoost and improve the accuracy rate of minority class classification.

There is also certain drawback in Boosting algorithm simultaneously, taking turns in iteration to each, the difference giving strategy of sample weights tends to final classifying quality is produced large effect.Simultaneously in the determination to base grader ballot weight coefficient, determine according to the final classification results of this base grader the most merely, but in the combination of base grader, often there is certain relation in these base graders, final coefficient combination can not be determined only with their classification accuracy, how determine that these coefficients combination of optimum is also the problem needing emphasis to consider.

The most methodical existing topmost drawback major part scheme is both between class uneven process, does not accounts for the existence of imbalance problem in class, so often not reaching highly desirable effect in final effect.

Summary of the invention

In order to solve problem in prior art, the invention provides a kind of sorting technique based on improvement DBSCAN-SMOTE algorithm for sample uneven in class, use and improve unbalanced situation in DBSCAN-SMOTE algorithm processes class.

The present invention is achieved through the following technical solutions:

A kind of based on the sorting technique improving DBSCAN-SMOTE algorithm, described method is for situation unbalanced in class in the process of data sample space, first in data sample set, judge which belongs to boundary sample, and boundary sample is divided into most class boundary sample and minority class boundary sample, cluster is used for the boundary sample in most class sample spaces；Then use PSO algorithm that the over-sampling rate of clustering cluster inner boundary sample and safe sample is optimized, by minority class boundary sample is carried out the over-sampling of different sample rate by SMOTE algorithm；Wherein, described cluster is based on improving DBSCAN algorithm, and the DBSCAN algorithm of described improvement includes: first, it is considered to the situation of uneven minority class sample distribution Density inhomogeneity in class, it is possible to obtain one group of EPS value based on distribution density；Then, these are formed a distance vector array by the average distance of calculated each minority class sample point, using these average distances as raw data set, by carrying out the cluster in distance on this data set；After by this distance, array is clustered into N number of bunch, all distances calculated in each bunch add and are averaged, and using this meansigma methods of obtaining as the adjacent region threshold of this bunch, by calculating this meansigma methods of N number of bunch respectively, can obtain N number of adjacent region threshold EPS_i, i=1,2 ..., N；Sort it follows that this N number of field threshold value is carried out order from small to large and be saved in an array；In ensuing clustering algorithm, first minimum that in threshold value array is selected, EPS value as DBSCAN algorithm, then all minority class samples are clustered, then use next threshold value in threshold value array that the minority class sample being labeled as noise sample point proceeds DBSCAN cluster, be similarly obtained some clustering cluster and remaining noise sample point.Finally, repeating above operation, after to being clustered by different EPS of all minority class samples, then complete all cluster operations of minority class sample, those are not the most classified as being noise data in the data of any one bunch.

As a further improvement on the present invention, described in data sample set, it is judged which belongs to boundary sample, and boundary sample is divided into most class boundary sample and minority class boundary sample use Borderline algorithm；Specifying unbalanced dataset to be sorted in advance is S={ (x₁,y₁),(x₂,y₂),…,(x_n,y_n), each sample is made up of the class mark belonging to characteristic vector and this sample, and characteristic vector is expressed as x, and class mark is expressed as y, i.e. x={x₁,x₂,…,x_n, the most class sample set S in y={Maj, Min}, S_majRepresent, equally, use S_minRepresent minority class.Then the process of Borderline algorithm is as follows:

(1) to S_minIn minority class sample data, use the method for k nearest neighbor to find its K nearest samples in whole data set S, and these samples be stored in the set KNNsmin that each Smin sample is corresponding.

(2) by three below formula each sample in Smin classified as boundary sample, noise sample and safe sample:

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} | > K/2 (1)

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} |=K (2)

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} |=0 (3)

The minority class data sample point wherein meeting formula (1) is then boundary sample, and border minority class sample is now inserted into minority class boundary sample collection S_BminIn；Meet the minority class data sample point of formula (2) then for noise sample；Meet the minority class data sample point of formula (3) then for safe sample.

As a further improvement on the present invention, described method is additionally included in the sample rate in minority class boundary sample space and is greater than the sample rate of non-boundary sample；Cluster is then used for the boundary sample in most class sample spaces, then replaces Most current class boundary sample bunch with cluster barycenter, other original samples in removing bunch.

The invention has the beneficial effects as follows: the present invention can not only produce the clustering cluster of minority class, in these samples bunch, carry out over-sampling the most again, but also the skewness in imbalance and fragmentation of data or the problem of Small disjuncts in can sufficiently solving class.The present invention is especially suitable for existing imbalanced data classification issue problem in class, and similar identification minority class class imbalance sample classification problem.Versatility when data set is classified by method proposed by the invention, can effectively process different types of data set.

Detailed description of the invention

Below in conjunction with detailed description of the invention, the present invention is further described.

Sample boundary overlap problem is also one of reason causing imbalanced data classification issue problem hard, so in the present invention, considers emphatically the boundary sample impact on final classifying quality, and takes corresponding measure to eliminate this impact.The method of main employing is exactly: the sample rate in minority class boundary sample space is greater than the sample rate of non-boundary sample；Cluster is then used for the boundary sample in most class sample spaces, then replaces Most current class boundary sample bunch with cluster barycenter, other original samples in removing bunch.After processing in terms of the two, then make original data space boundary sample become apparent from, eliminate the boundary sample impact on imbalanced data classification issue problem to a great extent.

For issue noted above, the present invention proposes the situation uneven based on the clustering algorithm improving DBSCAN, the data distributing position that can effectively process in class in imbalance problem or distribution density.Thought in combination with Borderline algorithm, consider boundary sample information, then use PSO algorithm that the over-sampling rate of clustering cluster inner boundary sample and safe sample is optimized, by minority class boundary sample is carried out the over-sampling of different sample rate by SMOTE algorithm.Secondly, carry out clustering the method that barycenter is replaced to most class boundary samples, carry out different process, the dual definition adding boundary sample respectively.Data sample after so having processed, not only and all tends to balance in class between class, also add the definition of boundary sample such that it is able to effectively promote final classifying quality.

First, being mainly described Borderline algorithm, the major function of this algorithm is exactly in data sample set, it is judged which belongs to boundary sample, and boundary sample is divided into most class boundary sample and minority class boundary sample.In the algorithm, specifying unbalanced dataset to be sorted in advance is S, and each sample is made up of the class mark belonging to characteristic vector and this sample, and characteristic vector is expressed as x, and class mark is expressed as y, i.e. x={x₁,x₂,…,x_n, y={Maj, Min}.So data set can be expressed as:

S={ (x₁,y₁)(x₂,y₂),…,(x_n,y_n)}

Most class sample set S in S_majRepresent, equally, use S_minRepresent minority class.Then the process of Borderline algorithm is as follows:

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} | > K/2 (2)

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} |=K (3)

{(x,y)|(x,y)∈KNN_smin∧ y=Maj} |=0 (4)

The minority class data sample point wherein meeting formula (2) is then boundary sample, and border minority class sample is now inserted into minority class boundary sample collection S_BminIn；Meet the minority class data sample point of formula (3) then for noise sample；Meet the minority class data sample point of formula (4) then for safe sample.

(3) by the method for (2), equally obtain most class data sample and sort out according to Borderline algorithm, for boundary sample, equally these samples are inserted into a most class boundary sample set S_BmajIn, noise sample directly removes, and safe sample does not carry out any operation.

By three above step, boundary sample ensemble space and the boundary sample ensemble space of most class of minority class can be respectively obtained, so in ensuing mixing sampling process, the process to boundary sample can be focused on, thus increasing the definition of boundary sample so that classifying quality has promoted.

To in unbalanced data sample classification problem, if unbalanced data is concentrated in there is class uneven, when using traditional clustering algorithm without any improvement to cluster, often cannot be clustered into want bunch.In class in imbalance, there is the skewness in minority class, and fragmentation of data or the problem of Small disjuncts, so needing to be improved appropriately traditional algorithm, using specifically can be for the clustering algorithm that carry out clustering uneven in class.

Sample is mainly clustered by the purpose of DBSCAN algorithm by density, and the sample meeting unit radius density size is clustered, and those sample points being unsatisfactory for this condition are taken as noise spot and discard.Unlike traditional clustering algorithm, DBSCAN algorithm it appeared that be clustered into various meet density shape bunch, it has the following advantages compared with traditional algorithm:

(1) comparing with K-means algorithm, DBSCAN need not the most artificially specify bunch number to be polymerized to and initial barycenter；

(2) bunch shape that DBSCAN is polymerized to will not occur significantly to change；

(3) can determine that some parameters arrived the effect of noise filtering according to practical situation.

The concrete cluster process of standard DBSCAN algorithm is as follows: according to unit radius set in advance and density, at whole data set, finds these points, referred to as core point meeting condition, then expands this core point.The method expanded is to find other sample points being connected from all of this core point with its density.Travel through all core points in the epsilon neighborhood of this core point (because boundary point cannot expand), find other minority class sample points being connected with this core point density, and judge that this point is the most also for core point, continue to expand according to this core point, until the core point that can expand that density is connected cannot be found.Next the minority class sample data set not being clustered cluster is rescaned exactly, concentrate to find at remaining data and be never clustered into other core points of any one bunch, then this core point uses said method carry out expanding until not having the new core point not being clustered cluster in whole data set.Cluster terminate after, the sample point not being clustered in either cluster in those minority class sample sets with regard to time noise spot.

Traditional DBSCAN clustering algorithm can solve the indeterminable problem of a lot of other clustering algorithm, but when in the class in unbalanced data sample classification unbalanced situation be that traditional DBSCAN algorithm also has certain drawback:

(1) unified ε and Minpts parameter is used due to DBSCAN algorithm, so when the distribution density in imbalance in class is uneven when, the clustering cluster that DBSCAN algorithm produces is frequently not optimal cluster result.So, for the situation pockety in imbalance in class, need to use different ε and Minpts parameters to cluster, the most just can generate corresponding clustering cluster according to different distribution densities.

(2), within there is class in the case of unbalanced fragmentation of data or Small disjuncts, standard DBSCAN algorithm can not be in view of the least problem in the region of fragmentation of data or Small disjuncts, and grader is likely to directly process as noise spot.

For with present on problem, use the DBSCAN algorithm based on improving to process these problems, thus unbalanced situation in solving class, its basic thought is:

First, it is considered to the situation of uneven minority class sample distribution Density inhomogeneity in class, it is possible to obtain one group of EPS value based on distribution density.Within there is class in unbalanced data set, each minority class data sample point is different from the distance of other minority class sample points, and namely distribution density is different.The calculating of distribution density is to be weighed by the distance of nearest K the minority class sample point of computed range any of which minority class sample to this sample point.Concrete grammar is as follows: by adding up arbitrary minority class sample point X_iNearest K other minority class sample points, calculate X respectively_iTo the distance of this K minority class sample, then these distances are taken a meansigma methods.By the distance average obtained, minority class sample point X can be obtained_iDistribution density, all of minority class sample point can calculate the average distance formula of this distribution density of such a measurement.

Then, these are formed a distance vector array by the average distance of calculated each minority class sample point.Using these average distances as raw data set, by carrying out the cluster in distance on this data set.After by this distance, array is clustered into N number of bunch, all distances calculated in each bunch add and are averaged, and using this meansigma methods of obtaining as the adjacent region threshold of this bunch, by calculating this meansigma methods of N number of bunch respectively, can obtain N number of adjacent region threshold EPS_i(i=1,2 ..., N).

It follows that this N number of field threshold value is ranked up, according to order from small to large, these sorted threshold values are saved in an array, stay the determination successively of the EPS parameter of the improvement DBSCAN algorithm doing next step.

In ensuing clustering algorithm, first minimum that in threshold value array is selected, as the EPS value of DBSCAN algorithm, (MinPts can be manually specified, keep constant in the training process), then all minority class samples are clustered, can be met multiple minority class sample clusterings bunch of this density by cluster, other minority class samples being unsatisfactory for condition are then classified as noise sample.Then use next threshold value in threshold value array that the minority class sample being labeled as noise sample point proceeds DBSCAN cluster, be similarly obtained some clustering cluster and remaining noise sample point.

Finally, repeat above operation, use in threshold value array different threshold values from small to large that the minority class sample being labeled as noise sample point is carried out DBSCAN cluster, after to being clustered by different EPS of all minority class samples, then completing all cluster operations of minority class sample, those are not the most classified as being noise data in the data of any one bunch.

By the DBSCAN algorithm improved, the clustering cluster of minority class can not only be produced, in these samples bunch, carry out over-sampling the most again, but also the skewness in imbalance and fragmentation of data or the problem of Small disjuncts in can sufficiently solving class.

Selecting AUC (Area under Receiver Operating Characteristics Curve) in the experiment of the performance of checking the inventive method is that a kind of visual classifier algorithm shows evaluation methodology, represents with a two-dimensional coordinate system.Wherein, X-axis is the ratio (FP_Rate) of wrong point minority class (positive), Y-axis is the ratio (TP_Rate) of the minority class (positive) of correct classification, each classifier algorithm can produce a some Point (FP_Rate after classifying one group of sample data, TP_Rate), the threshold values adjusting grader produces multiple points, forms ROC curve, and AUC is then the area that this curve covers the lower right corner.AUC is the biggest, then it represents that the disconnected identification ability of classifier algorithm is the strongest.

F-Measure is most frequently applied in the appraisal of imbalanced data classification issue, shown in equation below.F-Measure is obtained by recall ratio, precision ratio and balance factor are compound, and when Recall and Precision obtains a higher numerical value, F-Measure will obtain ideal result.

F - M e a s u r e = \frac{(1 + β^{2}) * Re c a l l * \Pr e c i s i o n}{β^{2} * Re c a l l + \Pr e c i s i o n} - - - (5)

β regulation recall ratio and the balance factor (usual β is set to 1) of precision ratio in formula.

Test environment Windows 7, python2.7, IDE:spyder, CPU:Intel (R) Xeon (R) E5-2609, RAM:16G.

Experimental data set of the present invention have employed there are eight data sets of uneven situation in class and two only exist unbalanced data set between class, these data sets all obtain in UCI data base.Table 1 describes the specific object of data set used by all experiments, and wherein No. is classified as data set number, and DataSet is dataset name, and #Attr is the number of attributes that data set comprises, and %Min represents minority class sample proportion.This experiment has 10 group data sets, and the uneven ratio the most often organized also differs, and sample total is the most different.

Table 1 experimental data set describes

For the effectiveness of innovatory algorithm DBS (DBSCAN-Smote) algorithm that the data plane of one of two the improvement aspects verifying the present invention proposes, the present invention chooses the resampling methods of traditional data plane has ROS (Random-OverSampling), SM (SMOTE) to test algorithm as a comparison.

The AUC, F-Measure of table 2 DBS algorithm experimental and overall situation accuracy rate

Table 2 is that above several over-sampling algorithms use AdaBoost.M1 to carry out some the index results classified on 10 group data sets.In table, NO. row sequence number represents the data set sequence number in 4-2, by the relative analysis classifying quality of each algorithm on AUC, F-Measure (F-Mea) and overall situation accuracy rate Acc tri-indexs.

Can add up from table 2 and find out, DBS algorithm, on used by the present invention 10 experimental data set, has and wherein achieves the highest AUC for 7 times, and optimum number of times is most；Having and wherein achieve the highest F-Measure value for 6 times, optimum number of times is most equally；Having simultaneously and wherein achieve the highest overall accuracy rate Acc for 7 times, optimum number of times is most equally.From these evaluation indexes, it can be concluded that DBS only considers unbalanced over-sampling algorithm between class relative to existing, in considering class, the uneven situation impact on classifying quality in can effectively solve the problem that class of the improvement DBS over-sampling algorithm of uneven situation, can be effectively improved imbalanced data classification issue effect from multiple indexs.

DBS algorithm is the impact of uneven situation in not only can solve class, only exists unbalanced data set classification between class simultaneously for those and the most also has good effect.Be can be seen that by the classifying quality of the data set in table 2, previously innovatory algorithm to only exist be trained on unbalanced data set between class time often with compare high classification accuracy, this is because imbalance is that one more relatively easily processes between class, it is easily able to a kind of situation of sample space equilibrating.Only exist unbalanced dataset between class by data set yeast5 and page-blocks0 two to classify, DBS, ROS and SM sorting algorithm has the best classifying quality, this explanation DBS is not only unbalanced a kind of efficient algorithm in considering emphatically and process class, also can reach good classifying quality to only existing unbalanced data set between class.While obtaining preferable classifying quality, demonstrate versatility when data set is classified by innovatory algorithm proposed by the invention, can effectively process different types of data set.

Above content is to combine concrete preferred implementation further description made for the present invention, it is impossible to assert the present invention be embodied as be confined to these explanations.For general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, it is also possible to make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims

1. a sorting technique based on improvement DBSCAN-SMOTE algorithm, it is characterised in that: described Method is for situation unbalanced in class in the process of data sample space, first in set of data samples In conjunction, it is judged which belongs to boundary sample, and boundary sample is divided into most class boundary sample and Minority class boundary sample, uses cluster for the boundary sample in most class sample spaces；Then adopt With PSO algorithm, the over-sampling rate of clustering cluster inner boundary sample and safe sample is optimized, passes through Minority class boundary sample is carried out by SMOTE algorithm the over-sampling of different sample rate；Wherein, Described cluster is based on improving DBSCAN algorithm, and the DBSCAN algorithm of described improvement includes: first First, it is considered to the situation of uneven minority class sample distribution Density inhomogeneity in class, it is possible to obtain a group EPS value based on distribution density；Then, by these by calculated each minority class sample Average distance one distance vector array of composition of point, using these average distances as initial data Collection, by carrying out the cluster in distance on this data set；This distance array is being clustered into After N number of bunch, all distances calculated in each bunch add and are averaged, by obtain this is average It is worth the adjacent region threshold as this bunch, by calculating this meansigma methods of N number of bunch respectively, can obtain N number of adjacent region threshold EPS_i, i=1,2 ..., N；It follows that this N number of field threshold value is carried out from little Sort to big order and be saved in an array；In ensuing clustering algorithm, first select Select in threshold value array minimum that, as the EPS value of DBSCAN algorithm, then to institute There is minority class sample to cluster, then use in threshold value array next threshold value to being labeled as noise The minority class sample of sample point proceed DBSCAN cluster, be similarly obtained some clustering cluster and Remaining noise sample point.Finally, repeat above operation, when all minority class samples are led to Cross after different EPS clusters, then complete all cluster operations of minority class sample, those Finally not the most being classified as the data at any one bunch is noise data.

Method the most according to claim 1, it is characterised in that: described in data sample set, sentence Break and which belong to boundary sample, and boundary sample is divided into most class boundary sample and minority class limit Boundary's sample uses Borderline algorithm；Appointment unbalanced dataset to be sorted is in advance S={ (x₁,y₁),(x₂,y₂),…,(x_n,y_n), each sample is by characteristic vector and this sample Affiliated class mark composition, characteristic vector is expressed as x, and class mark is expressed as y, i.e. x={x₁,x₂,…,x_n, Most class sample set S in y={Maj, Min}, S_majRepresent, equally, use S_minRepresent few Number class.Then the process of Borderline algorithm is as follows:

(1) to S_minIn minority class sample data, use k nearest neighbor method find it at whole number According to K nearest samples in collection S, and it is corresponding that these samples are stored in each Smin sample Set KNNsmin in.

(2) by three below formula, each sample in Smin is classified as boundary sample, noise sample Originally with safe sample:

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} | > K/2 (1)

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} |=K (2)

|{(x,y)|(x,y)∈KNN_smin∧ y=Maj} |=0 (3)

The minority class data sample point wherein meeting formula (1) is then boundary sample, now by border minority Class sample is inserted into minority class boundary sample collection S_BminIn；Meet the minority class data sample of formula (2) This point is then noise sample；Meet the minority class data sample point of formula (3) then for safe sample.

Method the most according to claim 1, it is characterised in that: described method is additionally included in minority class limit The sample rate of boundary's sample space is greater than the sample rate of non-boundary sample；For most class sample spaces In boundary sample then use cluster, then replace Most current class boundary sample with cluster barycenter Bunch, other original samples in removing bunch.

Method the most according to claim 1, it is characterised in that: it is characterized in that: PSO algorithm master If being optimized successive value, and characteristic vector here is discrete type, described in order to make PSO algorithm can process the characteristic vector of discrete type, uses sigmoid function, to the company generated Continuous value rate conversion is 0,1 discrete value.

Method the most according to claim 1, it is characterised in that: described method is true by PSO algorithm The boundary sample sample rate of each clustering cluster and safe specimen sample rate are determined and to classifying quality After having the feature of lifting, select the most representational data characteristics sum by feature extraction According to sample set, then by the over-sampling rate that optimization obtains, minority class sample is carried out over-sampling, Obtain the equilibrium criterion collection sample wanted eventually.