CN107092927A

CN107092927A - A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border

Info

Publication number: CN107092927A
Application number: CN201710211459.XA
Authority: CN
Inventors: 王喆; 李冬冬; 朱昱锦; 高大启
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2017-04-01
Filing date: 2017-04-01
Publication date: 2017-08-25

Abstract

The present invention provides a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border, and the sample of collection is switched into vector first；Original decision surface is generated with Pseudoinverse algorithm, and constructs the two new decision surfaces for crossing two class barycenter parallel to the decision surface and respectively, retains the sample being located between new decision surface and makees candidate, remove remaining sample；Then the distance that the more several classes of samples of candidate arrived such barycenter hyperplane is calculated, minority class makees same treatment, constitutes such distance vector；Finally by the distance of test sample o'clock to two new decision surfaces, compared respectively with two class distance vectors and count the number for the distance for being shorter than candidate samples and plane.Test sample is predicted to be that more class of number.Compared to traditional sorting technique, the present invention is trained by two steps, has merged the Pseudoinverse algorithm based on border and the heuristic nearest neighbor algorithm based on non-border, improved classification accuracy, and greatly shorten debug time.

Description

A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border

Technical field

The present invention relates to Pattern classification techniques field, more particularly to a kind of side for unbalanced dataset being identified processing Clear up pseudoinverse technique and system in boundary.

Background technology

Pattern-recognition is research and utilization computer to imitate or realize the recognition capability of the mankind or other animals, so as to grinding Study carefully the task that object completes automatic identification.In recent years, mode identification technology be widely used in artificial intelligence, machine learning, Computer engineering, robotics, Neurobiology, medical science, detective learn and archaeology, geological prospecting, Astronautics and weapon Many key areas such as technology.But, with the expansion of application field, traditional mode identification technology is faced with new challenges.Its In prominent challenge come from unbalanced data process problem.Unbalanced data is such a data, being permitted inside it In multi-class, the sample size of some classifications is much smaller than the sample size of remaining classification.For simplicity, the few class of sample number is called few Several classes of, the class more than sample number is called more several classes of.In practical application, cost of the minority class often than several classes of wrong point more is big, for example, cure When treating diagnosis, the cost for judging a potential sufferer by accident is bigger than the people of one actual health of erroneous judgement.Similarly, error detection, Hard measurement, financing prediction, medical treatment the field such as detect and there are a large amount of unbalanced datas.

Traditional method for classifying modes is when handling imbalance problem, due to the influence of more several classes of samples, frequently results in partially The larger result of difference.In order to solve imbalance problem, some specific methods are devised.At present, specifically designed for imbalance The method of problem can be divided into three classes：The first kind is the method based on data set, and such method is pre- in pattern by Sampling techniques Processing links cut down unnecessary more several classes of, or generation minority class, data set is tended to balance, then substitute into follow-up traditional classification mould Type.Such, which represents algorithm, includes unilateral down-sampling algorithm（One Side Selection）Algorithm is up-sampled with artificial minority class （Synthetic Minority Oversampling Technique）Deng；Equations of The Second Kind is the method based on cost-sensitive, such Method assigns the cost of different weights by the sample to mistake point, thus correct conventional model due to unbalanced data cause it is inclined Difference, it is however generally that, the minority class sample of mistake point obtains the cost higher than more several classes of samples of mistake point.Such, which represents algorithm, includes Projection algorithm is protected by cost-sensitive office（Cost-sensitive locality preserving projections）, cost-sensitive Principal Component Analysis Algorithm（Cost-sensitive principal component analysis）And cost-sensitive discriminant analysis Algorithm（Cost-sensitive linear discriminant analysis）Deng；3rd class is to be based on inheritance method, such Method carries out comprehensive descision by the way that different weak typing methods is synthesized together to data set.Such, which represents algorithm, includes AdaCost etc..

At present, three class methods all exist each not enough.First kind method is realized and is easier to, but is needed mostly in training module generation Enter all or at least most of sample, for mobility is strong or can not normal process the problem of not enough prior information.And the first kind Method based on sampling is primarily now used in combination with other method, can not be independently as the model of a solution problem. Equations of The Second Kind method and the 3rd class method are then often complicated, it is necessary to adjust quantity of parameters to obtain optimal value.Equations of The Second Kind method Calculation cost is caught up with outward to be needed to travel through most numerical example, causes efficiency to reduce.3rd class method is needed with batch of data substitution Different subclassification models, equally reduces efficiency.If can design simple for structure, parameter is less, and can correct a deviation very well Method, it will further improve disposal ability of the Pattern classification techniques on imbalance problem.

The content of the invention

For prior art construction is complicated, inefficiency and precision is not high, it is impossible to meet it is extensive, in real time or lack elder generation The imbalance problem of knowledge is tested, the invention provides a kind of unbalanced dataset classification side for the Pseudoinverse algorithm cleared up based on border Method, using the typical linear sorting technique based on Boundary algorithm --- Pseudoinverse algorithm carries out training for the first time to small-scale sample and obtained Candidate subset is taken, second of training is carried out to garbled sample using heuristic class boundary condition method obtains border, finally Unknown sample is tested by the smeared out boundary and measuring similarity strategy of acquisition, so as to ensure unbalanced dataset point While class accuracy, efficiency is improved in modelling and the aspect of model calculation two.

The technical solution adopted for the present invention to solve the technical problems（By taking two class imbalance problems as an example）：Backstage root first Described according to specific imbalance problem, the sample collected is changed into the vector model that can be handled for subsequent algorithm.Secondly, Training dataset and test data set two parts will be divided into the data set of vector representation.If data set is limited, can all it use Train.In training step, training for the first time is marked off greatly using the classification policy based on Pseudoinverse algorithm to training sample point The categorised decision face of cause, and further generate parallel to the categorised decision face and cross two new decision surfaces of Different categories of samples barycenter. Only it is located at two samples for crossing barycenter decision surface intermediate space and is kept as candidate subset, remaining sample is removed.Second During secondary training, the distance of the excessive several classes of barycenter decision surfaces of more several classes of sample points distance of each candidate is obtained, minority class is made same Processing, the distance of each sample of two classes generates two distance vectors.3rd, in test phase, obtain current test sample point and arrive Two are crossed after the distance of barycenter decision surface, and the two class distance vectors generated with the two distances and training module are respectively compared, and are led to Cross the distance for judging test sample point on which side and make final decision closer to barycenter decision surface.When two back gauges are equal When, algorithm uses the two class decision surfaces generated for the first time in training module to be judged that whole method deteriorates to original pseudoinverse Algorithm.Finally, the class label that output is determined.

The technical solution adopted for the present invention to solve the technical problems can also be further perfect.The of the training module In training step, pseudoinverse technique, which finds decision surface, can use various improved methods, as long as the method for amendment ensures multiple Miscellaneous degree and training speed.Further, the model can also use any suitable linear classification model to replace Pseudoinverse algorithm. But due to Pseudoinverse algorithm, structure is most simple in all linear classification models, and invention is still put into practice with Pseudoinverse algorithm.In addition, After the first time training step of described training module, when generating two decision surfaces for crossing barycenter, in order to improve efficiency, Ke Yixian Whether two class data of decision problem are overlapping, if non-overlapping, illustrate the data set linear separability, then need not perform subsequent step. Finally, the measuring similarity step of second and test are trained, the method for measuring similarity of use is defaulted as Euclidean distance.But root According to different situations, any metric form can be used, such as COS distance, mahalanobis distance.

The invention has the advantages that：Using the terseness that Pseudoinverse algorithm structure is cleared up based on border, realize to injustice The rapid feedback of weighing apparatus problem；Trained by two steps, merged Pseudoinverse algorithm based on border and heuristic near based on non-border Adjacent algorithm, improves classification accuracy；Due to the simplicity of pseudoinverse technique in itself, this method is set only to preset a parameter, significantly Shorten debug time；The border of this method formation is fuzzy, therefore still ensures that deviation will not when number of training is few It is excessive.

Brief description of the drawings

Fig. 1 is the system framework of the unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border of the present invention.

Fig. 2 is flow chart of the present invention in training step.

Fig. 3 is flow chart of the present invention in testing procedure.

Embodiment

The invention will be described further with reference to the accompanying drawings and examples：The method of the present invention is divided into three modules.

Part I：Data acquisition

Data acquisition is that, by the imbalance problem digitization in reality, the data set of generation vector representation is easy to subsequent module Handled.One sample generates a vector x, an attribute of the corresponding sample of vectorial each element, and vectorial dimension d is It is as shown below for the attribute number of sample：

In two class problems, minority class sample and more several classes of samples are respectively merged into the matrix X of a matrix, i.e. minority class_posWith More several classes of matrix X_neg, then two matrixes are merged into matrix X one big_all, this final matrix is whole problem Data set, it is as shown below：

。

Part II：Train classification models

In this module, the data set collected will be trained in the core algorithm for substituting into invention.Key step is as follows.

1) Pseudoinverse algorithm generation categorised decision face l is utilized_d：Pseudoinverse algorithm is typical linear classification algorithm, it is therefore an objective to raw Into a decision surface, fall the test sample on decision surface one side and be just judged as belonging to that classification with it with one side.Decision-making Face equation is expressed as：

Pseudoinverse algorithm is first by data set matrix X_allFirst row increase it is complete 1 vector, it is expanded to augmented matrix, below figure：

Optimal w and w is obtained by the formula of Pseudoinverse algorithm afterwards₀.Formula is as follows：

Wherein, d is the theoretical class label parameter pre-set.

2) two class training sample barycenter are crossed, are made parallel to l_dTwo classifying face l_posWith l_neg：The algorithm of two class barycenter is such as Shown in lower formula,

。

3) candidate subset is generated：L will be located at_posWith l_negBetween training sample be left as candidate subset C, remaining sample Remove.

4) judge current data set whether linear separability：Two class samples in C are judged with the presence or absence of overlapping, if in the absence of i.e. Current data set is linear separability, then normal linear disaggregated model can just complete identification, follow-up to use what Pseudoinverse algorithm was generated Decision surface l_dTested；If in the presence of into next step.Two class samples are from l in judging method of superposition for C_posWith l_neg Solstics x_pmax' and x_nmax' whether meet inequality：

It is linearly inseparable to meet.

5) two distance vector dis are generated_posWith dis_neg：Make one sample point x to classifying face l's of d (x, l) expressions Euclidean distance.More several classes of points in all C are then calculated successively to l_negDistance deposit dis_negIn, the similarly minority in all C Class point is to l_posDistance deposit dis_posIn.

Part III：Test unknown data

, it is necessary to detect that the unknown data of its class label substitutes into the model trained in the module, and made decision by model. If unknown sample is z.Test link comprises the following steps.

1）The Euclidean distance that test sample point z to two crosses barycenter classification plane is calculated, that is, obtains d (z, l_pos) and d (z, l_neg)。

2）For d (z, l_pos) enter dis_posIt is compared, obtains the probability that z is classified into minority class, calculation formula is as follows：

Wherein, molecule is dis_posMiddle numerical value is more than d (z, l_pos) element number, i.e. C focus on l_posDistance is than test sample Z to l_posApart from remote minority class number of samples.Denominator is minority class total sample number in C subsets.

3）Similarly, d (z, l are compared_neg) and dis_neg, calculation formula is as follows：

。

4）Compare P (y_pos~Z) with P (y_neg~Z), z class label is finally decided to be the larger one side of probability.

5）If P (y_pos~Z) with P (y_neg~Z) equal, z is by l_dThe decision-making equation of composition is determined.At this moment, algorithm degenerates to original Pseudoinverse technique.

Experimental result

1）Experimental data set is chosen：The experimental selection website Extraction based on Evolutionary that increase income Learning (KEEL) dataset repository six unbalanced datasets.Choose class number, the sample of data set Dimension, scale（Total sample number）And unbalance factor IR row are in the following table.Wherein IR be more than 9 for moderate above unbalanced data Collection,

Wherein, n is made_negFor more several classes of sample numbers, n_posFor minority class sample number, unbalance factor IR calculation formula is：

All data sets used are handled using five folding interleaved modes, i.e., data set is divided into substantially uniform five parts, each Secondary selection a copy of it is as test data, and four parts are training data in addition.Do not repeat to choose test data five times.

2）Contrast algorithm：Pseudoinverse algorithm, referred to as BEPILD are cleared up in core algorithm, i.e. border used in invention.In addition, We select algorithm on the basis of kNN, SVM (Linear), SVM (Poly), SVM (RBF).The parameter description of each algorithm and codomain Such as following table is set.

3）Performance metric method：Experiment is unified to use area under Receiver operating curve's line（the Area Under the Receiver operating characteristic Curve, AUC）To record classification knot of the distinct methods to each data set Really.Result is the result obtained when correspondence algorithm is configured on the data set using optimized parameter, i.e. optimal result.AUC's Calculation formula is：

Wherein TP is real class rate, and FP is false positive class rate, and TN is very negative class rate, and FN is false negative class rate.The relation of four indexs is such as Following table.

It is that BEPILD is contrasted with benchmark algorithm first, in an experiment, because nearest neighbor algorithm（kNN）Put down when neighbour's number k takes 1 Equal AUC, in order to highlight, only lists 1NN result higher than other numerical value are taken in form.The preferably knot of each data set Fruit is labeled as runic.As a result such as following table.

As table understands that BEPILD obtains highest AUC, i.e. optimal result on all data sets.

Then, by the BEPILD of proposition and average used time of the classics kNN algorithms on six data sets（Training time is with surveying The sum of examination time）In the following table, the best result of each data set is labeled as runic to record：

The result in table is it is recognized that while BEPILD has two processes of training and test, and kNN only has as Lazy learning and tested Journey, but from the time, BEPILD is more efficient.

Claims

1. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border, it is characterised in that：Comprise the concrete steps that：

1）, sample collection：Backstage is described according to specific imbalance problem, and the sample collected is changed into can be for follow-up calculation The vector model of method processing；

2）Train for the first time, obtain three categorised decision faces：First by the classification policy based on Pseudoinverse algorithm to training sample The categorised decision face marked off substantially is put, and further generates two parallel to the categorised decision face and two class sample barycenter of mistake New decision surface；Only it is located at two samples for crossing barycenter decision surface intermediate space and is kept as candidate subset, remaining sample quilt Remove；

3）Second of training, obtains two distance vectors：Obtain the excessive several classes of barycenter of more several classes of sample points distance of each candidate The distance of decision surface, minority class makees same processing, and the distance of the respective sample of two classes generates two distance vectors；

4）Test phase, calculates the probability that test sample point belongs to two class candidate samples, makes final decision：Currently tested Sample point is crossed to two after the distance of barycenter decision surface, and the two class distance vectors generated with the two distances and training module are distinguished Compare, final decision is made by judging distance of the test sample point on which side closer to barycenter decision surface.

2. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature It is：Described training stage first time, obtaining the specific steps in three categorised decision faces includes：Given birth to first with Pseudoinverse algorithm Constituent class decision surface l_d；Secondly two class training sample barycenter are crossed, make two classifying face l parallel to ld_posWith l_neg；Afterwards by position In l_posWith l_negBetween training sample be left as candidate subset C, remaining sample remove；Finally, it is that further simplify calculates, Then judge current data set whether linear separability；If current data set linear separability, it can be obtained immediately with Pseudoinverse algorithm Categorised decision face, if linearly inseparable, into next step.

3. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature It is：Second described of training stage, the details for obtaining two distance vectors is：Generate two distance vector dis_posWith dis_neg；Wherein, dis_posI-th of element representation candidate subset C in i-th of minority class sample arrived minority class sample barycenter With decision surface l_dParallel hyperplane l_posDistance, similarly dis_negUse identical method for expressing.

4. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature It is：Described test phase, calculating the specific steps for the probability that test sample point belongs to two class candidate samples includes：First, Calculate test sample point z to two and cross barycenter classification plane l_posWith l_negEuclidean distance, to l_posDistance be expressed as d (z, l_pos), to l_negDistance be expressed as d (z, l_neg)；Secondly, by d (z, l_pos) and distance vector dis_posInterior element compares one by one Compared with by dis_posMiddle numerical value is more than d (z, l_pos) element number as molecule, by dis_posElement sum be used as denominator, two A probability is worth to, this probability is the probability that test sample point z belongs to minority class, similarly, compares dis_posWith d (z, l_pos) obtain the probability that test sample belongs to more several classes of, finally compare two probability, z class label be finally decided to be probability compared with Big one side.