CN104679860B - A kind of sorting technique of unbalanced data - Google Patents
A kind of sorting technique of unbalanced data Download PDFInfo
- Publication number
- CN104679860B CN104679860B CN201510089729.5A CN201510089729A CN104679860B CN 104679860 B CN104679860 B CN 104679860B CN 201510089729 A CN201510089729 A CN 201510089729A CN 104679860 B CN104679860 B CN 104679860B
- Authority
- CN
- China
- Prior art keywords
- sample
- sample set
- training sample
- decision function
- membership
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Landscapes
- Complex Calculations (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of sorting technique of unbalanced data, including:The training sample set of unbalanced data is learnt, the first categorised decision function and the second categorised decision function is obtained;First degree of membership and the second degree of membership are respectively obtained by the first categorised decision function and the second categorised decision function;Categorised decision function is obtained according to first degree of membership and the second degree of membership;Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated;The sample of the second overlay region sample set is classified according to the categorised decision function pair.
Description
Technical field
The invention belongs to data classification technology field, more particularly to a kind of sorting technique of unbalanced data.
Background technology
The society of today is in the epoch of information explosion, in face of vast as the open sea data, how to be carried from the data of magnanimity
Useful information and knowledge is taken to turn into huge challenge.Just because of this, the statistical machine learning technology based on data is occurred in that, into
For the topmost method of knowledge acquisition, it designs a kind of appropriate learning algorithm mainly according to specific historical data, and then
Acquisition can reflect the mathematics or statistical model of data rule itself, for the top survey to Future Data.Just because of based on system
The importance in terms of knowledge acquisition of the machine learning method of meter, has become intellectual analysis and intelligent decision research field
Key problem, and be widely used in industry and business.
Wherein, most common Machine Learning Problems are the classification learnings of supervised, such as, living things feature recognition, text point
Class, web mining, speech recognition, network invasion monitoring etc..In in the past few decades, the research in machine learning field
Persons have made sufficient research to classification learning method, and many highly effective algorithms, so far, still extensive use are proposed in succession
In various occasions, including K- neighbours, decision tree, neutral net, integrated study and support vector machine method
(Support Vector Machine,SVM).Wherein, it is support vector machine method to attract attention most, and the method is a kind of
The Learning machine in Statistical Learning Theory and structuring principle of minimization risk is set up, with traditional learning algorithm such as neutral net
Compare, SVM has solid theoretical foundation, and last realization can be attributed to a secondary convex optimization problem, thus can obtain
To globally optimal solution, it is to avoid neutral net is easily trapped into the shortcoming of local optimum, and in the case where sample size is less, according to
So result in good generalization ability.Just due to these advantages, currently in theoretical circles and industrial quarters, SVM is research and utilization
Obtain one of most commonly used learning algorithm.
However, with the continuous expansion of application and deepening continuously for practice, also layer goes out not for new challenge and problem
Thoroughly, the classification learning problem of unbalanced data be exactly current machine learning field urgent need to resolve obstacle it.Specifically, it is uneven
Weighing apparatus data classification problem just refers to that certain class sample size is considerably less than the situation of other class samples, such as:Abnormal data analysis, invasion
Detection, fraud detection, video monitoring, fault diagnosis, medical diagnosis etc..However, traditional machine learning classification method is at place
When managing unbalanced data classification problem, the differentiation result of grader always tends to more several classes of samples, causes grader to few class
The recognition effect of sample is seriously degenerated, and in extensive application, generally we more pay close attention to the classification accuracy rate of few class sample, because
How this, avoided grader from staying bigger decision space to more several classes of samples and led into the research of unbalanced data sorting algorithm
One of the key problem in domain.
The researchers in machine learning field have done substantial amounts of research work to unbalanced data classification problem, have carried so far
Many different solutions are gone out, generally these methods may be summarized to be two types class:One class is started with from data Layer, is led to
The sample distribution for changing training set is crossed, weakens the degree of data nonbalance;Another kind of is the improvement by algorithm layer, for algorithm
The limitation when solving the problems, such as unbalanced data, suitably makes to algorithm and is correspondingly improved to be allowed to adapt to uneven number in itself
According to classification problem.
Equally, even for the such learning abilities of SVM very strong grader, unbalanced data problem is also resulted in
Results of learning sharp decline, in view of the validity of SVM methods and the popularity used, many researchers are specifically designed for injustice
Weigh data problem concerning study, corresponding research has been done to SVM methods, and propose some modified hydrothermal process, achieve it is certain into
Really, but generally speaking, existing method is not high to the nicety of grading of unbalanced data.
The content of the invention
To overcome existing defect, the invention provides a kind of sorting technique of unbalanced data.
According to an aspect of the present invention, it is proposed that a kind of sorting technique of unbalanced data, methods described includes following
Step:
The training sample set of unbalanced data is learnt, the first categorised decision function and the second categorised decision letter is obtained
Number;
The first degree of membership and second is respectively obtained by the first categorised decision function and the second categorised decision function to be subordinate to
Category degree;
Categorised decision function is obtained according to first degree of membership and the second degree of membership;
Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated;
The sample of the second overlay region sample set is classified according to the categorised decision function pair.
It is described to respectively obtain first by the first categorised decision function and the second categorised decision function in such scheme
Degree of membership and the second degree of membership include:
Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair respectively
The sample concentrated with Equations of The Second Kind training sample is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample
The sample of this collection constitutes the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described
First degree of membership of first kind training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set.
It is described to pass through the first kind described in the first categorised decision function and the second categorised decision function pair in such scheme
The sample that training sample set and Equations of The Second Kind training sample are concentrated judged, will belong to the first kind training sample set and the
The sample of two class training sample sets, which constitutes the first overlay region sample set, to be included:
By the logical relation between the first categorised decision function and the second categorised decision function by the first kind
The sample that training sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, belongs to the sample of first kind training sample concentration
Originally, belong to the sample of Equations of The Second Kind training sample concentration, belong to the first kind training sample set and Equations of The Second Kind training sample set
Sample, will belong to the first kind training sample set and Equations of The Second Kind training sample set sample constitute the first overlay region sample
Collection.
In such scheme, the calculating process of first degree of membership is:
Wherein:
For the first degree of membership, the sample x in the first overlay region sample set is representediBelong to the first kind training sample
The probability of collection;A represents the first kind training sample set;For the sample x in the first overlay region sample setiTo first kind instruction
Practice the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of sample set;For the sample in the first overlay region sample set
xiTo the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set.
In such scheme, the calculating process of second degree of membership is:
Wherein:
For the second degree of membership, the first overlay region sample x is representediBelong to the probability of the Equations of The Second Kind training sample set;B
Represent the Equations of The Second Kind training sample set.
It is described categorised decision function is obtained according to first degree of membership and the second degree of membership to include in such scheme:
Build double sample sets for being subordinate to SVMs;
It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double;
It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.
In such scheme, the described pair of calculating process for being subordinate to fuzzy support vector machine is:
Wherein:
W is the weight vector of Optimal Separating Hyperplane;C is noise punishment parameter;For the first degree of membership;
ξiFor the slack variable of the first non-negative;For the second degree of membership;ηiFor the slack variable of the second non-negative;
B is the threshold value of Optimal Separating Hyperplane;For nonlinear mapping function.
In such scheme, the calculating process of the categorised decision function is:
Wherein:
F (x) is categorised decision function;Sign () is sign function;αiFor the first Lagrange multiplier of sample;βiFor sample
This second Lagrange multiplier;K(x,xi) it is the kernel function for meeting Mercer conditions.
The present invention by the training sample set of unbalanced data obtains that the classification of unbalanced data characteristic of division can be characterized
Decision function, is classified by categorised decision function pair unbalanced data, can be according to data in unbalanced data itself
Feature carries out precise classification to unbalanced data.
Brief description of the drawings
Fig. 1 is the flow chart of the sorting technique of the unbalanced data of embodiment 1;
Fig. 2 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Pima-indians data sets;
Fig. 3 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Breast-w data sets;
Fig. 4 is classifying quality schematic diagram of the 3 class disaggregated models in embodiment 2 to Inosphere data sets.
In order to be able to clearly realize the structure of embodiments of the invention, certain size, structure and device are labelled with figure,
But it is only for illustrating needs, is not intended to limit the invention in the specific dimensions, structure, device and environment, according to specific
Need, these devices and environment can be adjusted or changed by one of ordinary skill in the art, the adjustment that is carried out or
Person's modification is still included in the scope of appended claims.
Embodiment
A kind of sorting technique of the unbalanced data provided below in conjunction with the accompanying drawings with specific embodiment the present invention carries out detailed
Thin description.
In the following description, multiple different aspects of the present invention will be described, however, for common skill in the art
For art personnel, the present invention can be implemented just with some or all structures or flow of the present invention.In order to explain
Definition for, elaborate specific number, configuration and order, however, it will be apparent that in the situation without these specific details
Under can also implement the present invention.In other cases, will no longer for some well-known features in order to not obscure the present invention
It is described in detail.
Embodiment 1
In order to solve the existing low deficiency of the nicety of grading to unbalanced data, a kind of uneven number is present embodiments provided
According to sorting technique, as shown in figure 1, the present embodiment method comprises the following steps:
Step S101:The training sample set of unbalanced data is learnt, the first categorised decision function and second is obtained
Categorised decision function;
In order to accurately classify to unbalanced data, first have to extract a part of data composition from unbalanced data
Training sample set, training sample set should characterize the ratio data in unbalanced data on the whole.Training sample is concentrated
Sample is divided into first kind training sample set and Equations of The Second Kind training sample set in the ratio of shared training sample set.Wherein, the first kind
Training sample set is the set for the sample for accounting for vast scale that training sample is concentrated, and Equations of The Second Kind training sample set is that training sample is concentrated
Remaining proportion sample set.Due to having been obtained for first kind training sample set and Equations of The Second Kind training sample set, so,
First categorised decision function and the second categorised decision function can characterize first kind training sample set and Equations of The Second Kind training well
The feature of sample set, is that the follow-up classification to unbalanced data is laid a good foundation.
Step S102:First degree of membership is respectively obtained by the first categorised decision function and the second categorised decision function
With the second degree of membership;
Sample in first kind training sample set is divided into three classes by the first categorised decision function, i.e. the first kind, belongs to first
Sample point inside the corresponding minimal hyper-sphere of class training sample set;Equations of The Second Kind, belongs to first kind training sample set corresponding most
The sample point on small suprasphere border;3rd class, the sample point belonged to outside the corresponding minimal hyper-sphere of first kind training sample set.
Similarly, the sample that the second categorised decision function also concentrates Equations of The Second Kind training sample is divided into above-mentioned three class.Due to first kind instruction
Practice sample set and Equations of The Second Kind training sample set and constitute whole training set, so, by the first categorised decision function and
Second categorised decision function just can determine that the sample for belonging to first kind training sample set and Equations of The Second Kind training sample set, by these
The set of the composition of sample is used as the first overlay region sample set.Then the sample calculated in the first overlay region sample set is belonging respectively to
The probability of first kind training sample set and Equations of The Second Kind training sample set, obtains the first degree of membership and the second degree of membership.I.e., now
The sample of first overlay region sample set has the attribute of first kind training sample set and Equations of The Second Kind training sample set, the first weight simultaneously
Namely easily there is the sample of mistake in classification in the sample of folded area's sample set.
Step S103:Categorised decision function is obtained according to first degree of membership and the second degree of membership;
, can be according to the first degree of membership after the first degree of membership and the second degree of membership that obtain the sample of the first overlay region sample set
With the second degree of membership build double sample sets for being subordinate to SVMs and it is double be subordinate to fuzzy support vector machine, be then subordinate to mould to double
Paste SVMs is handled with regard to that can obtain the categorised decision function for being classified to unbalanced data, categorised decision letter
Number can concentrate the degree of membership of sample to classify sample according to first kind training sample set and Equations of The Second Kind training sample.
Step S104:Determine the sample for the second overlay region sample set that the test sample of the unbalanced data is concentrated;
According to the method for the sample that the first overlay region sample set is obtained to training sample concentration sample process, to uneven number
According to test sample collection progress handle and obtain the sample of the second overlay region sample set.
Step S105:The sample of the second overlay region sample set is classified according to the categorised decision function pair.
Categorised decision function is the function for having been able to carry out unbalanced data precise classification, and categorised decision function is straight
Scoop out the sample for using the second overlay region sample set, it becomes possible to which precise classification is carried out to unbalanced data.
The present embodiment obtains characterizing point of unbalanced data characteristic of division by the training sample set of unbalanced data
Class decision function, is classified by categorised decision function pair unbalanced data, can be according to data in unbalanced data itself
Feature to unbalanced data carry out precise classification.
Specifically, step S102 includes:
Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair respectively
The sample concentrated with Equations of The Second Kind training sample is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample
The sample of this collection constitutes the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described
First degree of membership of first kind training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set.
Wherein, it is described that sample is trained by the first kind described in the first categorised decision function and the second categorised decision function pair
The sample that this collection and Equations of The Second Kind training sample are concentrated is judged, will belong to the first kind training sample set and Equations of The Second Kind instruction
Practicing sample the first overlay region sample set of composition of sample set includes:
By the logical relation between the first categorised decision function and the second categorised decision function by the first kind
The sample that training sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, belongs to the sample of first kind training sample concentration
Originally, belong to the sample of Equations of The Second Kind training sample concentration, belong to the first kind training sample set and Equations of The Second Kind training sample set
Sample totally four type, the sample composition first of the first kind training sample set and Equations of The Second Kind training sample set will be belonged to
Overlay region sample set, be specially:
If f+(xi)<0 and f-(xi)<0, then sample xiFor noise point;Wherein, f+(xi) it is the first categorised decision function;f-
(xi) it is the second categorised decision function, xiThe sample concentrated for the first kind training sample set or Equations of The Second Kind training sample, i=
0,1,…;
If f+(xi) >=0 and f-(xi)<0, then sample xiThe sample concentrated for the first kind training sample;
If f+(xi)<0 and f-(xi) >=0, then sample xiThe sample concentrated for the Equations of The Second Kind training sample;
If f+(xi)>0 and f-(xi)>0, then sample xiIt is also that the present invention is wanted for the sample in the first overlay region sample set
The sample set specifically classified.
Obtain after the first overlay region sample set, it is necessary to seek degree of membership to the sample in the first overlay region sample set, asking and being subordinate to
The method of degree has a variety of, herein using the dual membership based on distance, specifically, the calculating process of first degree of membership is:
Wherein:
For the first degree of membership, the sample x in the first overlay region sample set is representediBelong to the first kind training sample
The probability of collection;A represents the first kind training sample set;
For the sample x in the first overlay region sample setiTo the ball of the corresponding minimal hyper-sphere of first kind training sample set
Heart distance and the ratio of radius;Wherein, Φ+(xi) be the first overlay region sample set in sample xi
Value in the corresponding nonlinear mapping function of first kind training sample set;a+It is corresponding minimum super for first kind training sample set
The sphere centre coordinate of spheroid;R+For the radius of the corresponding minimal hyper-sphere of first kind training sample set;
For the sample x in the first overlay region sample setiTo the ball of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set
Heart distance and the ratio of radius;Wherein, Φ-(xi) be the first overlay region sample set in sample xi
Value in the corresponding nonlinear mapping function of Equations of The Second Kind training sample set;a-It is corresponding minimum super for Equations of The Second Kind training sample set
The sphere centre coordinate of spheroid;R-For the radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set.
The calculating process of second degree of membership is:
Wherein:
For the second degree of membership, the first overlay region sample x is representediBelong to the probability of the Equations of The Second Kind training sample set;B
Represent the Equations of The Second Kind training sample set.
Obtaining categorised decision function according to first degree of membership and the second degree of membership described in step S103 includes:
S1031:Build double sample sets for being subordinate to SVMs;
Double sample set needs for being subordinate to SVMs consider that belonging to the first of the first kind training sample set is subordinate to simultaneously
Category degree and the second degree of membership for belonging to the Equations of The Second Kind training sample set, and the first degree of membership and the second degree of membership and be 1.
S1032:It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double;It is described double
The calculating process for being subordinate to fuzzy support vector machine is:
Wherein:
W is the weight vector of Optimal Separating Hyperplane;
C is noise punishment parameter;
For the first degree of membership;
ξiFor the slack variable of the first non-negative;
For the second degree of membership;
ηiFor the slack variable of the second non-negative;ξiAnd ηiError bandwidth for reflecting each sample point;
B is the threshold value (the vertical intercept of hyperplane) of Optimal Separating Hyperplane;
For nonlinear mapping function.
S1033:It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.The categorised decision function
Calculating process is:
Wherein:
F (x) is categorised decision function;
Sign () is sign function;
αiFor the first Lagrange multiplier of sample;
βiFor the second Lagrange multiplier of sample;
K(x,xi) it is the kernel function for meeting Mercer conditions.
Obtain after categorised decision function, concentrate sample process to obtain the first overlay region sample set according still further to training sample
The method of sample obtains the sample of the second overlay region sample set of test sample collection, and it is overlapping that categorised decision function is applied into second
The sample of area's sample set, realizes and the data of the test sample collection of unbalanced data is classified.
Embodiment 2
The present invention is described in detail by the scene of a reality for the present embodiment.
The basic step of the present embodiment includes:
(1) using supporting vector test in data domain (Support Vector Data Domain Description,
SVDD) two class training sets (are accounted for by the first kind training sample set of vast scale and the Equations of The Second Kind training sample set of remaining proportion is accounted for)
Sample carries out single class study respectively, obtains the first categorised decision function f+(x) with the second categorised decision function f-(x), so as to recognize
Go out noise point, positive class sample (sample that first kind training sample is concentrated), the negative class sample (sample that Equations of The Second Kind training sample is concentrated
This) and the first overlay region sample set in sample;
(2) it is based on f+And f (x)-(x) and two class samples minimal hyper-sphere, calculate the first overlay region sample set in sample
This dual membership;
(3) it is subordinate to fuzzy support vector machine model to the sample use pair in the first overlay region sample set to be trained, obtains
To the categorised decision function f (x) of overlapping region sample;
(4) for test set sample, first using f+And f (x)-(x) noise point, positive class sample, negative class sample are identified as
Sheet or overlapping region sample;
(5) for the overlapping region sample of test set, its dual membership is calculated, is then subordinate to fuzzy support vector using double
The decision function f (x) of machine model is differentiated.
Wherein, the decision function building process in step (1) is as follows:
SVDD is directed to single class and learnt, and finds the suprasphere of a higher dimensional space to cover data as much as possible at this
The map of attribute space, so as to obtain data boundary feature.Give a set X={ x for including n data objecti| i=1,
2 ..., n }, the input space is mapped to high latitude space by SVDD by nonlinear mapping function Φ (), find a radius be R,
The centre of sphere covers x as much as possible for a supraspherei.SVDD sets up following optimization problem:
minR2
s.t.||Φ(xi)-a||2≤R2
I=1,2 ..., n
Slack variable vector ξ=(ξ is introduced in above formula1,ξ2,...,ξn) so that suprasphere can make a part of sample
Foreclosed portion for noise, optimization problem is transformed to:
s.t.||Φ(xi)-a||2≤R2+ξ
ξi≥0;I=1,2 ..., n
Wherein, q (R, ξ) is optimization problem object function;C is noise punishment parameter.Introducing Lagrangian can obtain:
OrderAbove formula can transform to:
Wherein, v is to target class very this refusal degree, 0≤v≤1.As v=0, nv is the lower limit of supporting vector;When
During v=1, nv is the upper limit of exterior point quantity (i.e. data amount check).Make L seek R, a and ξ local derviation respectively, and make it be 0, can obtain:
By inner product Φ (xi)Φ(xj) use Mercer function K (xi,xj) replace, the Wolfe antithesis that can obtain former optimal problem is asked
It is entitled:
According to optimal condition (Karush-Kuhn-Tucker, KKT) condition, therefore sample data can be divided into three classes:
The first kind is interior point, is to be located at the sample point inside suprasphere, its | | Φ (xi)-a||2<R2, i.e. αi=0,
Equations of The Second Kind is supporting vector, the sample point positioned at suprasphere border, its | | Φ (xi)-a||2=R2, i.e.,βi>0;
3rd class is exterior point, is to be located at the sample point outside suprasphere, its | | Φ (xi)-a||2>R2, i.e.,βi
=0
In order to verify the type of sample data, decision function is as follows:
F (x)=sgn (R2-||Φ(xi)-a||2)
It can thus be concluded that the decision function value of supporting vector is 0, the decision function value of interior point is more than 0, the decision function of exterior point
Value is less than 0.
Dual membership in step (3) obscures SVM algorithm (Double-Fuzzy support vector machine, D-
FSVM) process is as follows:
It is subordinate to sample set form in SVMs double and is:
Each sample is under the jurisdiction of two classes, i.e. sample x according to probability respectivelyiBelong to A classes (yi=probability 1) isBelong to B classes
(yi=-1) probability isWherein, yiFor the i-th class sample, in two category support vector machines models, sample is divided into A classes
With B classes, then yi∈ { -1 ,+1 }, i=1 ..., l.That is sample xiOnly correspond to " label " yi, yi=+1 explanation sample xiCategory
In A classes;yi=-1 explanation sample xiBelong to B classes.
It is double be subordinate to fuzzy support vector machine basic model be:
The Lagrangian of the problem is:
Wherein, αk,βk,vk,υkThe respectively first, second, third and fourth Lagrange multiplier of non-negative.
The optimal solution for solving former problem is equivalent to solve the optimal solution of its dual problem, and primal-dual optimization problem is:
I=1,2 ..., l
The higher dimensional space that the object function of above-mentioned primal-dual optimization problem is related to after the conversion does inner product operationIf the dimension in space is very high after nonlinear transformation, it can produce " dimension disaster ".To solve this problem,
According to Functional Theory, the kernel function K (x for meeting Mercer conditions can be usedi,xj) replace the inner product operation of high-dimensional feature space:
The classification operator finally given is:
From model above it can be seen that double be subordinate to the essential step that fuzzy support vector machine is different from traditional support vector machine
Just it is to determine that each sample point is subordinate to probability relative to A classes and B classes, therefore a very crucial step is how to set up degree of membership
Model portrays subjection degree of the training sample o'clock relative to two class samples.
Using the dual membership computational methods based on distance:
Wherein,Respectively equal to it is located at the sample of overlapping region
To the centre of sphere distance and the ratio of radius of two class minimal hyper-spheres.For the sample x in the first overlay region sample setiTo first
The centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of class training sample set;Φ+(xi) in the first overlay region sample set
Sample xiValue in the corresponding nonlinear mapping function of first kind training sample set;a+For first kind training sample set correspondence
Minimal hyper-sphere sphere centre coordinate;R+For the radius of the corresponding minimal hyper-sphere of first kind training sample set;For the first weight
Sample x in folded area's sample setiTo the centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set;
Φ-(xi) be the first overlay region sample set in sample xiIn the corresponding nonlinear mapping function of Equations of The Second Kind training sample set
Value;a-For the sphere centre coordinate of the corresponding minimal hyper-sphere of Equations of The Second Kind training sample set;R-It is corresponding most for Equations of The Second Kind training sample set
The radius of small suprasphere.
Scene below by way of a reality is illustrated to the present embodiment.
The present invention have chosen University of California, Irvine (University of California, Irvine,
UCI) Pi Ma American Indians diabetes data collection (Pima-indians), the University of Wisconsin's mammary gland in machine learning databases
The database such as cancer data set (Breast-w) and Johns Hopkins University's ionospheric data collection (Inosphere), each number
It is shown in Table according to the details in storehouse.
The essential information of table 1UCI data sets
Data set | Dimension | Positive class sample number | Negative class sample number | Total number of samples | Non-equilibrium ratio |
Pima-indians | 8 | 268 | 500 | 768 | 1:2 |
Breast-w | 9 | 241 | 458 | 699 | 1:2 |
Inosphere | 34 | 126 | 225 | 351 | 1:2 |
UCI data sets are carried out random division by the present invention, wherein 70% as training set, are left 30% as test set,
And ensure the constant of non-equilibrium ratio during division.
In order to analyze double performances for being subordinate to fuzzy support vector machine algorithm proposed by the present invention based on SVDD, the present invention is right
Include SVM, the SVM algorithm based on SVDD than Data set reconstruction model.SVM algorithm wherein based on SVDD and pair based on SVDD
It is subordinate to fuzzy support vector machine algorithm (D-FSVM) similar, simply in the 2nd step and the 4th step, overlapping area sample is sentenced
Common SVM models are used when other, during also without assign overlapping region sample dual membership.
The sorting algorithm evaluation index that the present invention is used is sensitivity (Sensitivity is abbreviated as SE), specificity
(Specificity is abbreviated as SP) and overall average nicety of grading (General Accuracy, be abbreviated as GA).Experimental result is such as
Shown in lower:
The experimental result of table 2
It can be seen from the results that seeing accompanying drawing 2, Fig. 3, Fig. 4, concentrated in three data, SVDD+SVM algorithms and SVDD+ (D-
FSVM) effect of algorithm is substantially better than common SVM models.Therefore, first using SVDD algorithms identify noise point, positive class sample,
Negative class sample and overlapping region sample, then again using SVM models or double fuzzy support vector machine models that are subordinate to overlay region
Domain sample is learnt, and can obtain preferable classifying quality.
Meanwhile, concentrated in three data, SE, SP and GA index of SVDD+ (D-FSVM) algorithm proposed by the present invention are all
Highest.Therefore, for overlapping region sample, dual membership can preferably portray sample point and belong to the relative of positive class and negative class
Degree, double fuzzy support vector machine models that are subordinate to more preferably can classify to overlapping area sample.
Finally it should be noted that above example is only to describe technical scheme rather than to this technology method
Limited, the present invention application can above extend to other modifications, change, using and embodiment, and it is taken as that institute
Have such modification, change, using, embodiment all in the range of the spirit or teaching of the present invention.
Claims (5)
1. a kind of sorting technique of unbalanced data, it is characterised in that the described method comprises the following steps:
The training sample of unbalanced data is divided into the ratio of shared training sample set:First kind training sample set and Equations of The Second Kind
Training sample set, the collection of these samples composition is combined into the first overlay region sample set, by entering to first overlay region sample set
Row study, obtains the first categorised decision function and the second categorised decision function;
First degree of membership and the second degree of membership are respectively obtained by the first categorised decision function and the second categorised decision function;
Categorised decision function is obtained according to first degree of membership and the second degree of membership;
Determine the sample for the first overlay region sample set that the test sample of the unbalanced data is concentrated;
The sample of the first overlay region sample set is classified according to the categorised decision function pair;
It is described that first degree of membership and the second person in servitude are respectively obtained by the first categorised decision function and the second categorised decision function
Category degree includes:
Pass through first kind training sample set described in the first categorised decision function and the second categorised decision function pair and respectively
The sample that two class training samples are concentrated is judged, will belong to the first kind training sample set and Equations of The Second Kind training sample set
Sample constitute the first overlay region sample set, and the sample calculated respectively in the sample set of first overlay region belongs to described first
First degree of membership of class training sample set and the second degree of membership for belonging to the Equations of The Second Kind training sample set;
The calculating process of first degree of membership is:
Wherein:
For the first degree of membership, represent that the sample xi in the first overlay region sample set belongs to the general of the first kind training sample set
Rate;A represents the first kind training sample set;For the sample xi in the first overlay region sample set to first kind training sample
Collect the centre of sphere distance and the ratio of radius of corresponding minimal hyper-sphere;For the sample xi to second in the first overlay region sample set
The centre of sphere distance and the ratio of radius of the corresponding minimal hyper-sphere of class training sample set;
The calculating process of second degree of membership is:
Wherein:
For the second degree of membership, represent that the first overlay region sample xi belongs to the probability of the Equations of The Second Kind training sample set;B represents institute
State Equations of The Second Kind training sample set.
2. according to the method described in claim 1, it is characterised in that described to pass through the first categorised decision function and second point
The sample that class decision function is concentrated to the first kind training sample set and Equations of The Second Kind training sample judges, by belonging to
Stating sample the first overlay region sample set of composition of first kind training sample set and Equations of The Second Kind training sample set includes:
The first kind is trained by the logical relation between the first categorised decision function and the second categorised decision function
The sample that sample set and Equations of The Second Kind training sample are concentrated is determined as noise point, the sample for belonging to first kind training sample concentration, category
The sample concentrated in Equations of The Second Kind training sample, belong to the sample of the first kind training sample set and Equations of The Second Kind training sample set
This, the first overlay region sample set is constituted by the sample for belonging to the first kind training sample set and Equations of The Second Kind training sample set.
3. according to the method described in claim 1, it is characterised in that obtained according to first degree of membership and second degree of membership
Include to categorised decision function:
Build double sample sets for being subordinate to SVMs;
It is subordinate to fuzzy support vector machine according to double sample sets determinations for being subordinate to SVMs are double;
It is subordinate to fuzzy support vector machine by described pair and obtains categorised decision function.
4. method according to claim 3, it is characterised in that double calculating process for being subordinate to fuzzy support vector machine
For:
<mfenced open = "" close = "">
<mtable>
<mtr>
<mtd>
<munder>
<mrow>
<mi>m</mi>
<mi>i</mi>
<mi>n</mi>
</mrow>
<mrow>
<mi>w</mi>
<mo>,</mo>
<mi>b</mi>
</mrow>
</munder>
</mtd>
<mtd>
<mrow>
<mfrac>
<mn>1</mn>
<mn>2</mn>
</mfrac>
<mo>|</mo>
<mo>|</mo>
<mi>w</mi>
<mo>|</mo>
<msup>
<mo>|</mo>
<mn>2</mn>
</msup>
<mo>+</mo>
<mi>C</mi>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>l</mi>
</munderover>
<mrow>
<mo>(</mo>
<msubsup>
<mi>&mu;</mi>
<mi>i</mi>
<mi>A</mi>
</msubsup>
<msub>
<mi>&xi;</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msubsup>
<mi>&mu;</mi>
<mi>i</mi>
<mi>B</mi>
</msubsup>
<msub>
<mi>&eta;</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
<mrow>
<msubsup>
<mi>&mu;</mi>
<mi>i</mi>
<mi>A</mi>
</msubsup>
<mo>+</mo>
<msubsup>
<mi>&mu;</mi>
<mi>i</mi>
<mi>B</mi>
</msubsup>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<msubsup>
<mi>&mu;</mi>
<mi>i</mi>
<mi>A</mi>
</msubsup>
<mo>&GreaterEqual;</mo>
<mn>0</mn>
<mo>,</mo>
<msubsup>
<mi>&mu;</mi>
<mi>i</mi>
<mi>B</mi>
</msubsup>
<mo>&GreaterEqual;</mo>
<mn>0</mn>
<mo>,</mo>
<msub>
<mi>&xi;</mi>
<mi>i</mi>
</msub>
<mo>&GreaterEqual;</mo>
<mn>0</mn>
<mo>,</mo>
<msub>
<mi>&eta;</mi>
<mi>i</mi>
</msub>
<mo>&GreaterEqual;</mo>
<mn>0</mn>
<mo>,</mo>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2</mn>
<mo>,</mo>
<mo>...</mo>
<mo>,</mo>
<mi>l</mi>
</mrow>
Wherein:
W is the weight vector of Optimal Separating Hyperplane;C is noise punishment parameter;For the first degree of membership;ξ i are the relaxation of the first non-negative
Variable;For the second degree of membership;η i are the slack variable of the second non-negative;B is the threshold value of Optimal Separating Hyperplane;Reflected to be non-linear
Penetrate function.
5. method according to claim 3, it is characterised in that the calculating process of the categorised decision function is:
Wherein:
F (x) is categorised decision function;S ign () are sign function;α i are the first Lagrange multiplier of sample;β i are sample
The second Lagrange multiplier;K (x, xi) is the kernel function for meeting Mercer conditions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510089729.5A CN104679860B (en) | 2015-02-27 | 2015-02-27 | A kind of sorting technique of unbalanced data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510089729.5A CN104679860B (en) | 2015-02-27 | 2015-02-27 | A kind of sorting technique of unbalanced data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104679860A CN104679860A (en) | 2015-06-03 |
CN104679860B true CN104679860B (en) | 2017-11-07 |
Family
ID=53314902
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510089729.5A Expired - Fee Related CN104679860B (en) | 2015-02-27 | 2015-02-27 | A kind of sorting technique of unbalanced data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104679860B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105005589B (en) * | 2015-06-26 | 2017-12-29 | 腾讯科技(深圳)有限公司 | A kind of method and apparatus of text classification |
CN105447520A (en) * | 2015-11-23 | 2016-03-30 | 盐城工学院 | Sample classification method based on weighted PTSVM (projection twin support vector machine) |
CN107463938B (en) * | 2017-06-26 | 2021-02-26 | 南京航空航天大学 | Aero-engine gas circuit component fault detection method based on interval correction support vector machine |
CN108960056B (en) * | 2018-05-30 | 2022-06-03 | 西南交通大学 | Fall detection method based on attitude analysis and support vector data description |
CN110555054B (en) * | 2018-06-15 | 2023-06-09 | 泉州信息工程学院 | Data classification method and system based on fuzzy double-supersphere classification model |
CN109165694B (en) * | 2018-09-12 | 2022-07-08 | 太原理工大学 | Method and system for classifying unbalanced data sets |
CN109919931B (en) * | 2019-03-08 | 2020-12-25 | 数坤(北京)网络科技有限公司 | Coronary stenosis degree evaluation model training method and evaluation system |
CN111126577A (en) * | 2020-03-30 | 2020-05-08 | 北京精诊医疗科技有限公司 | Loss function design method for unbalanced samples |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402690A (en) * | 2011-09-28 | 2012-04-04 | 南京师范大学 | Data classification method based on intuitive fuzzy integration and system |
CN102945280A (en) * | 2012-11-15 | 2013-02-27 | 翟云 | Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method |
CN104268577A (en) * | 2014-06-27 | 2015-01-07 | 大连理工大学 | Human body behavior identification method based on inertial sensor |
-
2015
- 2015-02-27 CN CN201510089729.5A patent/CN104679860B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402690A (en) * | 2011-09-28 | 2012-04-04 | 南京师范大学 | Data classification method based on intuitive fuzzy integration and system |
CN102945280A (en) * | 2012-11-15 | 2013-02-27 | 翟云 | Unbalanced data distribution-based multi-heterogeneous base classifier fusion classification method |
CN104268577A (en) * | 2014-06-27 | 2015-01-07 | 大连理工大学 | Human body behavior identification method based on inertial sensor |
Non-Patent Citations (2)
Title |
---|
基于双隶属度模糊支持向量机的邮件过滤;孙名松等;《计算机工程与应用》;20100120;第46卷(第2期);第94页第2节、第3.1节,第95页第4节、第5.2节 * |
基于类权重的模糊不平衡数据分类方法;薛贞霞等;《计算机科学》;20081130;第35卷(第11期);第171页第3节 * |
Also Published As
Publication number | Publication date |
---|---|
CN104679860A (en) | 2015-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104679860B (en) | A kind of sorting technique of unbalanced data | |
US20230141886A1 (en) | Method for assessing hazard on flood sensitivity based on ensemble learning | |
Chen et al. | Regional disaster risk assessment of China based on self-organizing map: clustering, visualization and ranking | |
He et al. | Mining transition rules of cellular automata for simulating urban expansion by using the deep learning techniques | |
CN106897738B (en) | A kind of pedestrian detection method based on semi-supervised learning | |
Wang et al. | Assessment of river water quality based on theory of variable fuzzy sets and fuzzy binary comparison method | |
CN107123123A (en) | Image segmentation quality evaluating method based on convolutional neural networks | |
Wu et al. | A hybrid support vector regression approach for rainfall forecasting using particle swarm optimization and projection pursuit technology | |
CN105487526A (en) | FastRVM (fast relevance vector machine) wastewater treatment fault diagnosis method | |
CN106408030A (en) | SAR image classification method based on middle lamella semantic attribute and convolution neural network | |
CN108764621A (en) | A kind of family endowment collaboration nurse dispatching method of data-driven | |
CN112785450A (en) | Soil environment quality partitioning method and system | |
Jia et al. | Fault diagnosis of industrial process based on the optimal parametric t-distributed stochastic neighbor embedding | |
Zhang et al. | Surface and high-altitude combined rainfall forecasting using convolutional neural network | |
CN107729922A (en) | Remote sensing images method for extracting roads based on deep learning super-resolution technique | |
CN106600046A (en) | Multi-classifier fusion-based land unused condition prediction method and device | |
Zhang et al. | Urban spatial risk prediction and optimization analysis of POI based on deep learning from the perspective of an epidemic | |
CN112418571A (en) | Method and device for enterprise environmental protection comprehensive evaluation | |
Zhang et al. | Information fusion for automated post-disaster building damage evaluation using deep neural network | |
CN114399212A (en) | Ecological environment quality evaluation method and device, electronic equipment and storage medium | |
CN107909278A (en) | A kind of method and system of program capability comprehensive assessment | |
CN117875517A (en) | Yellow river flood prevention command decision support system and method | |
CN111401683B (en) | Method and device for measuring tradition of ancient villages | |
Li et al. | Evaluation of livable city based on GIS and PSO-SVM: A case study of hunan province | |
Inyang et al. | Visual association analytics approach to predictive modelling of students’ academic performance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20171107 |
|
CF01 | Termination of patent right due to non-payment of annual fee |