CN104679860B

CN104679860B - A kind of sorting technique of unbalanced data

Info

Publication number: CN104679860B
Application number: CN201510089729.5A
Authority: CN
Inventors: 王理; 邓卫国; 钱中; 王祎旸; 许波; 雷超; 游越
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2015-02-27
Filing date: 2015-02-27
Publication date: 2017-11-07
Anticipated expiration: 2035-02-27
Also published as: CN104679860A

Abstract

The invention discloses a kind of sorting technique of unbalanced data, including：The training sample set of unbalanced data is learnt, the first categorised decision function and the second categorised decision function is obtained；First degree of membership and the second degree of membership are respectively obtained by the first categorised decision function and the second categorised decision function；Categorised decision function is obtained according to first degree of membership and the second degree of membership；Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated；The sample of the second overlay region sample set is classified according to the categorised decision function pair.

Description

A kind of sorting technique of unbalanced data

Technical field

The invention belongs to data classification technology field, more particularly to a kind of sorting technique of unbalanced data.

Background technology

The society of today is in the epoch of information explosion, in face of vast as the open sea data, how to be carried from the data of magnanimity Useful information and knowledge is taken to turn into huge challenge.Just because of this, the statistical machine learning technology based on data is occurred in that, into For the topmost method of knowledge acquisition, it designs a kind of appropriate learning algorithm mainly according to specific historical data, and then Acquisition can reflect the mathematics or statistical model of data rule itself, for the top survey to Future Data.Just because of based on system The importance in terms of knowledge acquisition of the machine learning method of meter, has become intellectual analysis and intelligent decision research field Key problem, and be widely used in industry and business.

Wherein, most common Machine Learning Problems are the classification learnings of supervised, such as, living things feature recognition, text point Class, web mining, speech recognition, network invasion monitoring etc..In in the past few decades, the research in machine learning field Persons have made sufficient research to classification learning method, and many highly effective algorithms, so far, still extensive use are proposed in succession In various occasions, including K- neighbours, decision tree, neutral net, integrated study and support vector machine method (Support Vector Machine,SVM).Wherein, it is support vector machine method to attract attention most, and the method is a kind of The Learning machine in Statistical Learning Theory and structuring principle of minimization risk is set up, with traditional learning algorithm such as neutral net Compare, SVM has solid theoretical foundation, and last realization can be attributed to a secondary convex optimization problem, thus can obtain To globally optimal solution, it is to avoid neutral net is easily trapped into the shortcoming of local optimum, and in the case where sample size is less, according to So result in good generalization ability.Just due to these advantages, currently in theoretical circles and industrial quarters, SVM is research and utilization Obtain one of most commonly used learning algorithm.

However, with the continuous expansion of application and deepening continuously for practice, also layer goes out not for new challenge and problem Thoroughly, the classification learning problem of unbalanced data be exactly current machine learning field urgent need to resolve obstacle it.Specifically, it is uneven Weighing apparatus data classification problem just refers to that certain class sample size is considerably less than the situation of other class samples, such as：Abnormal data analysis, invasion Detection, fraud detection, video monitoring, fault diagnosis, medical diagnosis etc..However, traditional machine learning classification method is at place When managing unbalanced data classification problem, the differentiation result of grader always tends to more several classes of samples, causes grader to few class The recognition effect of sample is seriously degenerated, and in extensive application, generally we more pay close attention to the classification accuracy rate of few class sample, because How this, avoided grader from staying bigger decision space to more several classes of samples and led into the research of unbalanced data sorting algorithm One of the key problem in domain.

The researchers in machine learning field have done substantial amounts of research work to unbalanced data classification problem, have carried so far Many different solutions are gone out, generally these methods may be summarized to be two types class：One class is started with from data Layer, is led to The sample distribution for changing training set is crossed, weakens the degree of data nonbalance；Another kind of is the improvement by algorithm layer, for algorithm The limitation when solving the problems, such as unbalanced data, suitably makes to algorithm and is correspondingly improved to be allowed to adapt to uneven number in itself According to classification problem.

Equally, even for the such learning abilities of SVM very strong grader, unbalanced data problem is also resulted in Results of learning sharp decline, in view of the validity of SVM methods and the popularity used, many researchers are specifically designed for injustice Weigh data problem concerning study, corresponding research has been done to SVM methods, and propose some modified hydrothermal process, achieve it is certain into Really, but generally speaking, existing method is not high to the nicety of grading of unbalanced data.

The content of the invention

To overcome existing defect, the invention provides a kind of sorting technique of unbalanced data.

According to an aspect of the present invention, it is proposed that a kind of sorting technique of unbalanced data, methods described includes following Step：

The training sample set of unbalanced data is learnt, the first categorised decision function and the second categorised decision letter is obtained Number；

The first degree of membership and second is respectively obtained by the first categorised decision function and the second categorised decision function to be subordinate to Category degree；

Categorised decision function is obtained according to first degree of membership and the second degree of membership；

Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated；

The sample of the second overlay region sample set is classified according to the categorised decision function pair.

It is described to respectively obtain first by the first categorised decision function and the second categorised decision function in such scheme Degree of membership and the second degree of membership include：

Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair respectively The sample concentrated with Equations of The Second Kind training sample is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample The sample of this collection constitutes the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described First degree of membership of first kind training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set.

It is described to pass through the first kind described in the first categorised decision function and the second categorised decision function pair in such scheme The sample that training sample set and Equations of The Second Kind training sample are concentrated judged, will belong to the first kind training sample set and the The sample of two class training sample sets, which constitutes the first overlay region sample set, to be included：

By the logical relation between the first categorised decision function and the second categorised decision function by the first kind The sample that training sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, belongs to the sample of first kind training sample concentration Originally, belong to the sample of Equations of The Second Kind training sample concentration, belong to the first kind training sample set and Equations of The Second Kind training sample set Sample, will belong to the first kind training sample set and Equations of The Second Kind training sample set sample constitute the first overlay region sample Collection.

In such scheme, the calculating process of first degree of membership is：

Wherein：

For the first degree of membership, the sample x in the first overlay region sample set is represented_iBelong to the first kind training sample The probability of collection；A represents the first kind training sample set；For the sample x in the first overlay region sample set_iTo first kind instruction Practice the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of sample set；For the sample in the first overlay region sample set x_iTo the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set.

In such scheme, the calculating process of second degree of membership is：

Wherein：

For the second degree of membership, the first overlay region sample x is represented_iBelong to the probability of the Equations of The Second Kind training sample set；B Represent the Equations of The Second Kind training sample set.

It is described categorised decision function is obtained according to first degree of membership and the second degree of membership to include in such scheme：

Build double sample sets for being subordinate to SVMs；

It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double；

It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.

In such scheme, the described pair of calculating process for being subordinate to fuzzy support vector machine is：

Wherein：

W is the weight vector of Optimal Separating Hyperplane；C is noise punishment parameter；For the first degree of membership；

ξ_iFor the slack variable of the first non-negative；For the second degree of membership；η_iFor the slack variable of the second non-negative；

B is the threshold value of Optimal Separating Hyperplane；For nonlinear mapping function.

In such scheme, the calculating process of the categorised decision function is：

Wherein：

F (x) is categorised decision function；Sign () is sign function；α_iFor the first Lagrange multiplier of sample；β_iFor sample This second Lagrange multiplier；K(x,x_i) it is the kernel function for meeting Mercer conditions.

The present invention by the training sample set of unbalanced data obtains that the classification of unbalanced data characteristic of division can be characterized Decision function, is classified by categorised decision function pair unbalanced data, can be according to data in unbalanced data itself Feature carries out precise classification to unbalanced data.

Brief description of the drawings

Fig. 1 is the flow chart of the sorting technique of the unbalanced data of embodiment 1；

Fig. 2 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Pima-indians data sets；

Fig. 3 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Breast-w data sets；

Fig. 4 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Inosphere data sets.

In order to be able to clearly realize the structure of embodiments of the invention, certain size, structure and device are labelled with figure, But it is only for illustrating needs, is not intended to limit the invention in the specific dimensions, structure, device and environment, according to specific Need, these devices and environment can be adjusted or changed by one of ordinary skill in the art, the adjustment that is carried out or Person's modification is still included in the scope of appended claims.

Embodiment

A kind of sorting technique of the unbalanced data provided below in conjunction with the accompanying drawings with specific embodiment the present invention carries out detailed Thin description.

In the following description, multiple different aspects of the present invention will be described, however, for common skill in the art For art personnel, the present invention can be implemented just with some or all structures or flow of the present invention.In order to explain Definition for, elaborate specific number, configuration and order, however, it will be apparent that in the situation without these specific details Under can also implement the present invention.In other cases, will no longer for some well-known features in order to not obscure the present invention It is described in detail.

Embodiment 1

In order to solve the existing low deficiency of the nicety of grading to unbalanced data, a kind of uneven number is present embodiments provided According to sorting technique, as shown in figure 1, the present embodiment method comprises the following steps：

Step S101：The training sample set of unbalanced data is learnt, the first categorised decision function and second is obtained Categorised decision function；

In order to accurately classify to unbalanced data, first have to extract a part of data composition from unbalanced data Training sample set, training sample set should characterize the ratio data in unbalanced data on the whole.Training sample is concentrated Sample is divided into first kind training sample set and Equations of The Second Kind training sample set in the ratio of shared training sample set.Wherein, the first kind Training sample set is the set for the sample for accounting for vast scale that training sample is concentrated, and Equations of The Second Kind training sample set is that training sample is concentrated Remaining proportion sample set.Due to having been obtained for first kind training sample set and Equations of The Second Kind training sample set, so, First categorised decision function and the second categorised decision function can characterize first kind training sample set and Equations of The Second Kind training well The feature of sample set, is that the follow-up classification to unbalanced data is laid a good foundation.

Step S102：First degree of membership is respectively obtained by the first categorised decision function and the second categorised decision function With the second degree of membership；

Sample in first kind training sample set is divided into three classes by the first categorised decision function, i.e. the first kind, belongs to first Sample point inside the corresponding minimal hyper-sphere of class training sample set；Equations of The Second Kind, belongs to first kind training sample set corresponding most The sample point on small suprasphere border；3rd class, the sample point belonged to outside the corresponding minimal hyper-sphere of first kind training sample set. Similarly, the sample that the second categorised decision function also concentrates Equations of The Second Kind training sample is divided into above-mentioned three class.Due to first kind instruction Practice sample set and Equations of The Second Kind training sample set and constitute whole training set, so, by the first categorised decision function and Second categorised decision function just can determine that the sample for belonging to first kind training sample set and Equations of The Second Kind training sample set, by these The set of the composition of sample is used as the first overlay region sample set.Then the sample calculated in the first overlay region sample set is belonging respectively to The probability of first kind training sample set and Equations of The Second Kind training sample set, obtains the first degree of membership and the second degree of membership.I.e., now The sample of first overlay region sample set has the attribute of first kind training sample set and Equations of The Second Kind training sample set, the first weight simultaneously Namely easily there is the sample of mistake in classification in the sample of folded area's sample set.

Step S103：Categorised decision function is obtained according to first degree of membership and the second degree of membership；

, can be according to the first degree of membership after the first degree of membership and the second degree of membership that obtain the sample of the first overlay region sample set With the second degree of membership build double sample sets for being subordinate to SVMs and it is double be subordinate to fuzzy support vector machine, be then subordinate to mould to double Paste SVMs is handled with regard to that can obtain the categorised decision function for being classified to unbalanced data, categorised decision letter Number can concentrate the degree of membership of sample to classify sample according to first kind training sample set and Equations of The Second Kind training sample.

Step S104：Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated；

According to the method for the sample that the first overlay region sample set is obtained to training sample concentration sample process, to uneven number According to test sample collection progress handle and obtain the sample of the second overlay region sample set.

Step S105：The sample of the second overlay region sample set is classified according to the categorised decision function pair.

Categorised decision function is the function for having been able to carry out unbalanced data precise classification, and categorised decision function is straight Scoop out the sample for using the second overlay region sample set, it becomes possible to which precise classification is carried out to unbalanced data.

The present embodiment obtains characterizing point of unbalanced data characteristic of division by the training sample set of unbalanced data Class decision function, is classified by categorised decision function pair unbalanced data, can be according to data in unbalanced data itself Feature to unbalanced data carry out precise classification.

Specifically, step S102 includes：

Wherein, it is described that sample is trained by the first kind described in the first categorised decision function and the second categorised decision function pair The sample that this collection and Equations of The Second Kind training sample are concentrated is judged, will belong to the first kind training sample set and Equations of The Second Kind instruction Practicing sample the first overlay region sample set of composition of sample set includes：

By the logical relation between the first categorised decision function and the second categorised decision function by the first kind The sample that training sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, belongs to the sample of first kind training sample concentration Originally, belong to the sample of Equations of The Second Kind training sample concentration, belong to the first kind training sample set and Equations of The Second Kind training sample set Sample totally four type, the sample composition first of the first kind training sample set and Equations of The Second Kind training sample set will be belonged to Overlay region sample set, be specially：

If f⁺(x_i)<0 and f^-(x_i)<0, then sample x_iFor noise point；Wherein, f⁺(x_i) it is the first categorised decision function；f^- (x_i) it is the second categorised decision function, x_iThe sample concentrated for the first kind training sample set or Equations of The Second Kind training sample, i= 0,1,…；

If f⁺(x_i) >=0 and f^-(x_i)<0, then sample x_iThe sample concentrated for the first kind training sample；

If f⁺(x_i)<0 and f^-(x_i) >=0, then sample x_iThe sample concentrated for the Equations of The Second Kind training sample；

If f⁺(x_i)>0 and f^-(x_i)>0, then sample x_iIt is also that the present invention is wanted for the sample in the first overlay region sample set The sample set specifically classified.

Obtain after the first overlay region sample set, it is necessary to seek degree of membership to the sample in the first overlay region sample set, asking and being subordinate to The method of degree has a variety of, herein using the dual membership based on distance, specifically, the calculating process of first degree of membership is：

Wherein：

For the first degree of membership, the sample x in the first overlay region sample set is represented_iBelong to the first kind training sample The probability of collection；A represents the first kind training sample set；

For the sample x in the first overlay region sample set_iTo the ball of the corresponding minimal hyper-sphere of first kind training sample set Heart distance and the ratio of radius；Wherein, Φ⁺(x_i) be the first overlay region sample set in sample x_i Value in the corresponding nonlinear mapping function of first kind training sample set；a⁺It is corresponding minimum super for first kind training sample set The sphere centre coordinate of spheroid；R⁺For the radius of the corresponding minimal hyper-sphere of first kind training sample set；

For the sample x in the first overlay region sample set_iTo the ball of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set Heart distance and the ratio of radius；Wherein, Φ^-(x_i) be the first overlay region sample set in sample x_i Value in the corresponding nonlinear mapping function of Equations of The Second Kind training sample set；a^-It is corresponding minimum super for Equations of The Second Kind training sample set The sphere centre coordinate of spheroid；R^-For the radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set.

The calculating process of second degree of membership is：

Wherein：

Obtaining categorised decision function according to first degree of membership and the second degree of membership described in step S103 includes：

S1031：Build double sample sets for being subordinate to SVMs；

Double sample set needs for being subordinate to SVMs consider that belonging to the first of the first kind training sample set is subordinate to simultaneously Category degree and the second degree of membership for belonging to the Equations of The Second Kind training sample set, and the first degree of membership and the second degree of membership and be 1.

S1032：It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double；It is described double The calculating process for being subordinate to fuzzy support vector machine is：

Wherein：

W is the weight vector of Optimal Separating Hyperplane；

C is noise punishment parameter；

For the first degree of membership；

ξ_iFor the slack variable of the first non-negative；

For the second degree of membership；

η_iFor the slack variable of the second non-negative；ξ_iAnd η_iError bandwidth for reflecting each sample point；

B is the threshold value (the vertical intercept of hyperplane) of Optimal Separating Hyperplane；

For nonlinear mapping function.

S1033：It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.The categorised decision function Calculating process is：

Wherein：

F (x) is categorised decision function；

Sign () is sign function；

α_iFor the first Lagrange multiplier of sample；

β_iFor the second Lagrange multiplier of sample；

K(x,x_i) it is the kernel function for meeting Mercer conditions.

Obtain after categorised decision function, concentrate sample process to obtain the first overlay region sample set according still further to training sample The method of sample obtains the sample of the second overlay region sample set of test sample collection, and it is overlapping that categorised decision function is applied into second The sample of area's sample set, realizes and the data of the test sample collection of unbalanced data is classified.

Embodiment 2

The present invention is described in detail by the scene of a reality for the present embodiment.

The basic step of the present embodiment includes：

(1) using supporting vector test in data domain (Support Vector Data Domain Description, SVDD) two class training sets (are accounted for by the first kind training sample set of vast scale and the Equations of The Second Kind training sample set of remaining proportion is accounted for) Sample carries out single class study respectively, obtains the first categorised decision function f⁺(x) with the second categorised decision function f^-(x), so as to recognize Go out noise point, positive class sample (sample that first kind training sample is concentrated), the negative class sample (sample that Equations of The Second Kind training sample is concentrated This) and the first overlay region sample set in sample；

(2) it is based on f⁺And f (x)^-(x) and two class samples minimal hyper-sphere, calculate the first overlay region sample set in sample This dual membership；

(3) it is subordinate to fuzzy support vector machine model to the sample use pair in the first overlay region sample set to be trained, obtains To the categorised decision function f (x) of overlapping region sample；

(4) for test set sample, first using f⁺And f (x)^-(x) noise point, positive class sample, negative class sample are identified as Sheet or overlapping region sample；

(5) for the overlapping region sample of test set, its dual membership is calculated, is then subordinate to fuzzy support vector using double The decision function f (x) of machine model is differentiated.

Wherein, the decision function building process in step (1) is as follows：

SVDD is directed to single class and learnt, and finds the suprasphere of a higher dimensional space to cover data as much as possible at this The map of attribute space, so as to obtain data boundary feature.Give a set X={ x for including n data object_i| i=1, 2 ..., n }, the input space is mapped to high latitude space by SVDD by nonlinear mapping function Φ (), find a radius be R, The centre of sphere covers x as much as possible for a suprasphere_i.SVDD sets up following optimization problem：

minR²

s.t.||Φ(x_i)-a||²≤R²

I=1,2 ..., n

Slack variable vector ξ=(ξ is introduced in above formula₁,ξ₂,...,ξ_n) so that suprasphere can make a part of sample Foreclosed portion for noise, optimization problem is transformed to：

s.t.||Φ(x_i)-a||²≤R²+ξ

ξ_i≥0；I=1,2 ..., n

Wherein, q (R, ξ) is optimization problem object function；C is noise punishment parameter.Introducing Lagrangian can obtain：

OrderAbove formula can transform to：

Wherein, v is to target class very this refusal degree, 0≤v≤1.As v=0, nv is the lower limit of supporting vector；When During v=1, nv is the upper limit of exterior point quantity (i.e. data amount check).Make L seek R, a and ξ local derviation respectively, and make it be 0, can obtain：

By inner product Φ (x_i)Φ(x_j) use Mercer function K (x_i,x_j) replace, the Wolfe antithesis that can obtain former optimal problem is asked It is entitled：

According to optimal condition (Karush-Kuhn-Tucker, KKT) condition, therefore sample data can be divided into three classes：

The first kind is interior point, is to be located at the sample point inside suprasphere, its | | Φ (x_i)-a||²<R², i.e. α_i=0,

Equations of The Second Kind is supporting vector, the sample point positioned at suprasphere border, its | | Φ (x_i)-a||²=R², i.e.,β_i>0；

3rd class is exterior point, is to be located at the sample point outside suprasphere, its | | Φ (x_i)-a||²>R², i.e.,β_i =0

In order to verify the type of sample data, decision function is as follows：

F (x)=sgn (R²-||Φ(x_i)-a||²)

It can thus be concluded that the decision function value of supporting vector is 0, the decision function value of interior point is more than 0, the decision function of exterior point Value is less than 0.

Dual membership in step (3) obscures SVM algorithm (Double-Fuzzy support vector machine, D- FSVM) process is as follows：

It is subordinate to sample set form in SVMs double and is：

Each sample is under the jurisdiction of two classes, i.e. sample x according to probability respectively_iBelong to A classes (y_i=probability 1) isBelong to B classes (y_i=-1) probability isWherein, y_iFor the i-th class sample, in two category support vector machines models, sample is divided into A classes With B classes, then y_i∈ { -1 ,+1 }, i=1 ..., l.That is sample x_iOnly correspond to " label " y_i, y_i=+1 explanation sample x_iCategory In A classes；y_i=-1 explanation sample x_iBelong to B classes.

It is double be subordinate to fuzzy support vector machine basic model be：

The Lagrangian of the problem is：

Wherein, α_k,β_k,v_k,υ_kThe respectively first, second, third and fourth Lagrange multiplier of non-negative.

The optimal solution for solving former problem is equivalent to solve the optimal solution of its dual problem, and primal-dual optimization problem is：

I=1,2 ..., l

The higher dimensional space that the object function of above-mentioned primal-dual optimization problem is related to after the conversion does inner product operationIf the dimension in space is very high after nonlinear transformation, it can produce " dimension disaster ".To solve this problem, According to Functional Theory, the kernel function K (x for meeting Mercer conditions can be used_i,x_j) replace the inner product operation of high-dimensional feature space：

The classification operator finally given is：

From model above it can be seen that double be subordinate to the essential step that fuzzy support vector machine is different from traditional support vector machine Just it is to determine that each sample point is subordinate to probability relative to A classes and B classes, therefore a very crucial step is how to set up degree of membership Model portrays subjection degree of the training sample o'clock relative to two class samples.

Using the dual membership computational methods based on distance：

Wherein,Respectively equal to it is located at the sample of overlapping region To the centre of sphere distance and the ratio of radius of two class minimal hyper-spheres.For the sample x in the first overlay region sample set_iTo first The centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of class training sample set；Φ⁺(x_i) in the first overlay region sample set Sample x_iValue in the corresponding nonlinear mapping function of first kind training sample set；a⁺For first kind training sample set correspondence Minimal hyper-sphere sphere centre coordinate；R⁺For the radius of the corresponding minimal hyper-sphere of first kind training sample set；For the first weight Sample x in folded area's sample set_iTo the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set； Φ^-(x_i) be the first overlay region sample set in sample x_iIn the corresponding nonlinear mapping function of Equations of The Second Kind training sample set Value；a^-For the sphere centre coordinate of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set；R^-It is corresponding most for Equations of The Second Kind training sample set The radius of small suprasphere.

Scene below by way of a reality is illustrated to the present embodiment.

The present invention have chosen University of California, Irvine (University of California, Irvine, UCI) Pi Ma American Indians diabetes data collection (Pima-indians), the University of Wisconsin's mammary gland in machine learning databases The database such as cancer data set (Breast-w) and Johns Hopkins University's ionospheric data collection (Inosphere), each number It is shown in Table according to the details in storehouse.

The essential information of table 1UCI data sets

Data set	Dimension	Positive class sample number	Negative class sample number	Total number of samples	Non-equilibrium ratio
						Pima-indians	8	268	500	768	1:2
Breast-w	9	241	458	699	1:2
						Inosphere	34	126	225	351	1:2

UCI data sets are carried out random division by the present invention, wherein 70% as training set, are left 30% as test set, And ensure the constant of non-equilibrium ratio during division.

In order to analyze double performances for being subordinate to fuzzy support vector machine algorithm proposed by the present invention based on SVDD, the present invention is right Include SVM, the SVM algorithm based on SVDD than Data set reconstruction model.SVM algorithm wherein based on SVDD and pair based on SVDD It is subordinate to fuzzy support vector machine algorithm (D-FSVM) similar, simply in the 2nd step and the 4th step, overlapping area sample is sentenced Common SVM models are used when other, during also without assign overlapping region sample dual membership.

The sorting algorithm evaluation index that the present invention is used is sensitivity (Sensitivity is abbreviated as SE), specificity (Specificity is abbreviated as SP) and overall average nicety of grading (General Accuracy, be abbreviated as GA).Experimental result is such as Shown in lower：

The experimental result of table 2

It can be seen from the results that seeing accompanying drawing 2, Fig. 3, Fig. 4, concentrated in three data, SVDD+SVM algorithms and SVDD+ (D- FSVM) effect of algorithm is substantially better than common SVM models.Therefore, first using SVDD algorithms identify noise point, positive class sample, Negative class sample and overlapping region sample, then again using SVM models or double fuzzy support vector machine models that are subordinate to overlay region Domain sample is learnt, and can obtain preferable classifying quality.

Meanwhile, concentrated in three data, SE, SP and GA index of SVDD+ (D-FSVM) algorithm proposed by the present invention are all Highest.Therefore, for overlapping region sample, dual membership can preferably portray sample point and belong to the relative of positive class and negative class Degree, double fuzzy support vector machine models that are subordinate to more preferably can classify to overlapping area sample.

Finally it should be noted that above example is only to describe technical scheme rather than to this technology method Limited, the present invention application can above extend to other modifications, change, using and embodiment, and it is taken as that institute Have such modification, change, using, embodiment all in the range of the spirit or teaching of the present invention.

Claims

1. a kind of sorting technique of unbalanced data, it is characterised in that the described method comprises the following steps：

The training sample of unbalanced data is divided into the ratio of shared training sample set：First kind training sample set and Equations of The Second Kind Training sample set, the collection of these samples composition is combined into the first overlay region sample set, by entering to first overlay region sample set Row study, obtains the first categorised decision function and the second categorised decision function；

First degree of membership and the second degree of membership are respectively obtained by the first categorised decision function and the second categorised decision function；

Determine the sample for the first overlay region sample set that the test sample of the unbalanced data is concentrated；

The sample of the first overlay region sample set is classified according to the categorised decision function pair；

It is described that first degree of membership and the second person in servitude are respectively obtained by the first categorised decision function and the second categorised decision function Category degree includes：

Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair and respectively The sample that two class training samples are concentrated is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample set Sample constitute the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described first First degree of membership of class training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set；

The calculating process of first degree of membership is：

Wherein：

For the first degree of membership, represent that the sample xi in the first overlay region sample set belongs to the general of the first kind training sample set Rate；A represents the first kind training sample set；For the sample xi in the first overlay region sample set to first kind training sample Collect the centre of sphere distance and the ratio of radius of corresponding minimal hyper-sphere；For the sample xi to second in the first overlay region sample set The centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of class training sample set；

The calculating process of second degree of membership is：

Wherein：

For the second degree of membership, represent that the first overlay region sample xi belongs to the probability of the Equations of The Second Kind training sample set；B represents institute State Equations of The Second Kind training sample set.

2. according to the method described in claim 1, it is characterised in that described to pass through the first categorised decision function and second point The sample that class decision function is concentrated to the first kind training sample set and Equations of The Second Kind training sample judges, by belonging to Stating sample the first overlay region sample set of composition of first kind training sample set and Equations of The Second Kind training sample set includes：

The first kind is trained by the logical relation between the first categorised decision function and the second categorised decision function The sample that sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, the sample for belonging to first kind training sample concentration, category The sample concentrated in Equations of The Second Kind training sample, belong to the sample of the first kind training sample set and Equations of The Second Kind training sample set This, the first overlay region sample set is constituted by the sample for belonging to the first kind training sample set and Equations of The Second Kind training sample set.

3. according to the method described in claim 1, it is characterised in that obtained according to first degree of membership and second degree of membership Include to categorised decision function：

Build double sample sets for being subordinate to SVMs；

4. method according to claim 3, it is characterised in that double calculating process for being subordinate to fuzzy support vector machine For：

<mfenced open = "" close = ""> <mtable> <mtr> <mtd> <munder> <mrow> <mi>m</mi> <mi>i</mi> <mi>n</mi> </mrow> <mrow> <mi>w</mi> <mo>,</mo> <mi>b</mi> </mrow> </munder> </mtd> <mtd> <mrow> <mfrac> <mn>1</mn> <mn>2</mn> </mfrac> <mo>|</mo> <mo>|</mo> <mi>w</mi> <mo>|</mo> <msup> <mo>|</mo> <mn>2</mn> </msup> <mo>+</mo> <mi>C</mi> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>l</mi> </munderover> <mrow> <mo>(</mo> <msubsup> <mi>&mu;</mi> <mi>i</mi> <mi>A</mi> </msubsup> <msub> <mi>&xi;</mi> <mi>i</mi> </msub> <mo>+</mo> <msubsup> <mi>&mu;</mi> <mi>i</mi> <mi>B</mi> </msubsup> <msub> <mi>&eta;</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> </mrow> </mtd> </mtr> </mtable> </mfenced>

<mrow> <msubsup> <mi>&mu;</mi> <mi>i</mi> <mi>A</mi> </msubsup> <mo>&GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <msubsup> <mi>&mu;</mi> <mi>i</mi> <mi>B</mi> </msubsup> <mo>&GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <msub> <mi>&xi;</mi> <mi>i</mi> </msub> <mo>&GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <msub> <mi>&eta;</mi> <mi>i</mi> </msub> <mo>&GreaterEqual;</mo> <mn>0</mn> <mo>,</mo> <mi>i</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mi>l</mi> </mrow>

Wherein：

W is the weight vector of Optimal Separating Hyperplane；C is noise punishment parameter；For the first degree of membership；ξ i are the relaxation of the first non-negative Variable；For the second degree of membership；η i are the slack variable of the second non-negative；B is the threshold value of Optimal Separating Hyperplane；Reflected to be non-linear Penetrate function.

5. method according to claim 3, it is characterised in that the calculating process of the categorised decision function is：

Wherein：

F (x) is categorised decision function；S ign () are sign function；α i are the first Lagrange multiplier of sample；β i are sample The second Lagrange multiplier；K (x, xi) is the kernel function for meeting Mercer conditions.