CN107092927A - A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border - Google Patents

A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border Download PDF

Info

Publication number
CN107092927A
CN107092927A CN201710211459.XA CN201710211459A CN107092927A CN 107092927 A CN107092927 A CN 107092927A CN 201710211459 A CN201710211459 A CN 201710211459A CN 107092927 A CN107092927 A CN 107092927A
Authority
CN
China
Prior art keywords
distance
sample
class
pos
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710211459.XA
Other languages
Chinese (zh)
Inventor
王喆
李冬冬
朱昱锦
高大启
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China University of Science and Technology
Original Assignee
East China University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China University of Science and Technology filed Critical East China University of Science and Technology
Priority to CN201710211459.XA priority Critical patent/CN107092927A/en
Publication of CN107092927A publication Critical patent/CN107092927A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/245Classification techniques relating to the decision surface
    • G06F18/2451Classification techniques relating to the decision surface linear, e.g. hyperplane
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border, and the sample of collection is switched into vector first;Original decision surface is generated with Pseudoinverse algorithm, and constructs the two new decision surfaces for crossing two class barycenter parallel to the decision surface and respectively, retains the sample being located between new decision surface and makees candidate, remove remaining sample;Then the distance that the more several classes of samples of candidate arrived such barycenter hyperplane is calculated, minority class makees same treatment, constitutes such distance vector;Finally by the distance of test sample o'clock to two new decision surfaces, compared respectively with two class distance vectors and count the number for the distance for being shorter than candidate samples and plane.Test sample is predicted to be that more class of number.Compared to traditional sorting technique, the present invention is trained by two steps, has merged the Pseudoinverse algorithm based on border and the heuristic nearest neighbor algorithm based on non-border, improved classification accuracy, and greatly shorten debug time.

Description

A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border
Technical field
The present invention relates to Pattern classification techniques field, more particularly to a kind of side for unbalanced dataset being identified processing Clear up pseudoinverse technique and system in boundary.
Background technology
Pattern-recognition is research and utilization computer to imitate or realize the recognition capability of the mankind or other animals, so as to grinding Study carefully the task that object completes automatic identification.In recent years, mode identification technology be widely used in artificial intelligence, machine learning, Computer engineering, robotics, Neurobiology, medical science, detective learn and archaeology, geological prospecting, Astronautics and weapon Many key areas such as technology.But, with the expansion of application field, traditional mode identification technology is faced with new challenges.Its In prominent challenge come from unbalanced data process problem.Unbalanced data is such a data, being permitted inside it In multi-class, the sample size of some classifications is much smaller than the sample size of remaining classification.For simplicity, the few class of sample number is called few Several classes of, the class more than sample number is called more several classes of.In practical application, cost of the minority class often than several classes of wrong point more is big, for example, cure When treating diagnosis, the cost for judging a potential sufferer by accident is bigger than the people of one actual health of erroneous judgement.Similarly, error detection, Hard measurement, financing prediction, medical treatment the field such as detect and there are a large amount of unbalanced datas.
Traditional method for classifying modes is when handling imbalance problem, due to the influence of more several classes of samples, frequently results in partially The larger result of difference.In order to solve imbalance problem, some specific methods are devised.At present, specifically designed for imbalance The method of problem can be divided into three classes:The first kind is the method based on data set, and such method is pre- in pattern by Sampling techniques Processing links cut down unnecessary more several classes of, or generation minority class, data set is tended to balance, then substitute into follow-up traditional classification mould Type.Such, which represents algorithm, includes unilateral down-sampling algorithm(One Side Selection)Algorithm is up-sampled with artificial minority class (Synthetic Minority Oversampling Technique)Deng;Equations of The Second Kind is the method based on cost-sensitive, such Method assigns the cost of different weights by the sample to mistake point, thus correct conventional model due to unbalanced data cause it is inclined Difference, it is however generally that, the minority class sample of mistake point obtains the cost higher than more several classes of samples of mistake point.Such, which represents algorithm, includes Projection algorithm is protected by cost-sensitive office(Cost-sensitive locality preserving projections), cost-sensitive Principal Component Analysis Algorithm(Cost-sensitive principal component analysis)And cost-sensitive discriminant analysis Algorithm(Cost-sensitive linear discriminant analysis)Deng;3rd class is to be based on inheritance method, such Method carries out comprehensive descision by the way that different weak typing methods is synthesized together to data set.Such, which represents algorithm, includes AdaCost etc..
At present, three class methods all exist each not enough.First kind method is realized and is easier to, but is needed mostly in training module generation Enter all or at least most of sample, for mobility is strong or can not normal process the problem of not enough prior information.And the first kind Method based on sampling is primarily now used in combination with other method, can not be independently as the model of a solution problem. Equations of The Second Kind method and the 3rd class method are then often complicated, it is necessary to adjust quantity of parameters to obtain optimal value.Equations of The Second Kind method Calculation cost is caught up with outward to be needed to travel through most numerical example, causes efficiency to reduce.3rd class method is needed with batch of data substitution Different subclassification models, equally reduces efficiency.If can design simple for structure, parameter is less, and can correct a deviation very well Method, it will further improve disposal ability of the Pattern classification techniques on imbalance problem.
The content of the invention
For prior art construction is complicated, inefficiency and precision is not high, it is impossible to meet it is extensive, in real time or lack elder generation The imbalance problem of knowledge is tested, the invention provides a kind of unbalanced dataset classification side for the Pseudoinverse algorithm cleared up based on border Method, using the typical linear sorting technique based on Boundary algorithm --- Pseudoinverse algorithm carries out training for the first time to small-scale sample and obtained Candidate subset is taken, second of training is carried out to garbled sample using heuristic class boundary condition method obtains border, finally Unknown sample is tested by the smeared out boundary and measuring similarity strategy of acquisition, so as to ensure unbalanced dataset point While class accuracy, efficiency is improved in modelling and the aspect of model calculation two.
The technical solution adopted for the present invention to solve the technical problems(By taking two class imbalance problems as an example):Backstage root first Described according to specific imbalance problem, the sample collected is changed into the vector model that can be handled for subsequent algorithm.Secondly, Training dataset and test data set two parts will be divided into the data set of vector representation.If data set is limited, can all it use Train.In training step, training for the first time is marked off greatly using the classification policy based on Pseudoinverse algorithm to training sample point The categorised decision face of cause, and further generate parallel to the categorised decision face and cross two new decision surfaces of Different categories of samples barycenter. Only it is located at two samples for crossing barycenter decision surface intermediate space and is kept as candidate subset, remaining sample is removed.Second During secondary training, the distance of the excessive several classes of barycenter decision surfaces of more several classes of sample points distance of each candidate is obtained, minority class is made same Processing, the distance of each sample of two classes generates two distance vectors.3rd, in test phase, obtain current test sample point and arrive Two are crossed after the distance of barycenter decision surface, and the two class distance vectors generated with the two distances and training module are respectively compared, and are led to Cross the distance for judging test sample point on which side and make final decision closer to barycenter decision surface.When two back gauges are equal When, algorithm uses the two class decision surfaces generated for the first time in training module to be judged that whole method deteriorates to original pseudoinverse Algorithm.Finally, the class label that output is determined.
The technical solution adopted for the present invention to solve the technical problems can also be further perfect.The of the training module In training step, pseudoinverse technique, which finds decision surface, can use various improved methods, as long as the method for amendment ensures multiple Miscellaneous degree and training speed.Further, the model can also use any suitable linear classification model to replace Pseudoinverse algorithm. But due to Pseudoinverse algorithm, structure is most simple in all linear classification models, and invention is still put into practice with Pseudoinverse algorithm.In addition, After the first time training step of described training module, when generating two decision surfaces for crossing barycenter, in order to improve efficiency, Ke Yixian Whether two class data of decision problem are overlapping, if non-overlapping, illustrate the data set linear separability, then need not perform subsequent step. Finally, the measuring similarity step of second and test are trained, the method for measuring similarity of use is defaulted as Euclidean distance.But root According to different situations, any metric form can be used, such as COS distance, mahalanobis distance.
The invention has the advantages that:Using the terseness that Pseudoinverse algorithm structure is cleared up based on border, realize to injustice The rapid feedback of weighing apparatus problem;Trained by two steps, merged Pseudoinverse algorithm based on border and heuristic near based on non-border Adjacent algorithm, improves classification accuracy;Due to the simplicity of pseudoinverse technique in itself, this method is set only to preset a parameter, significantly Shorten debug time;The border of this method formation is fuzzy, therefore still ensures that deviation will not when number of training is few It is excessive.
Brief description of the drawings
Fig. 1 is the system framework of the unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border of the present invention.
Fig. 2 is flow chart of the present invention in training step.
Fig. 3 is flow chart of the present invention in testing procedure.
Embodiment
The invention will be described further with reference to the accompanying drawings and examples:The method of the present invention is divided into three modules.
Part I:Data acquisition
Data acquisition is that, by the imbalance problem digitization in reality, the data set of generation vector representation is easy to subsequent module Handled.One sample generates a vector x, an attribute of the corresponding sample of vectorial each element, and vectorial dimension d is It is as shown below for the attribute number of sample:
In two class problems, minority class sample and more several classes of samples are respectively merged into the matrix X of a matrix, i.e. minority classposWith More several classes of matrix Xneg, then two matrixes are merged into matrix X one bigall, this final matrix is whole problem Data set, it is as shown below:
Part II:Train classification models
In this module, the data set collected will be trained in the core algorithm for substituting into invention.Key step is as follows.
1) Pseudoinverse algorithm generation categorised decision face l is utilizedd:Pseudoinverse algorithm is typical linear classification algorithm, it is therefore an objective to raw Into a decision surface, fall the test sample on decision surface one side and be just judged as belonging to that classification with it with one side.Decision-making Face equation is expressed as:
Pseudoinverse algorithm is first by data set matrix XallFirst row increase it is complete 1 vector, it is expanded to augmented matrix, below figure:
Optimal w and w is obtained by the formula of Pseudoinverse algorithm afterwards0.Formula is as follows:
Wherein, d is the theoretical class label parameter pre-set.
2) two class training sample barycenter are crossed, are made parallel to ldTwo classifying face lposWith lneg:The algorithm of two class barycenter is such as Shown in lower formula,
3) candidate subset is generated:L will be located atposWith lnegBetween training sample be left as candidate subset C, remaining sample Remove.
4) judge current data set whether linear separability:Two class samples in C are judged with the presence or absence of overlapping, if in the absence of i.e. Current data set is linear separability, then normal linear disaggregated model can just complete identification, follow-up to use what Pseudoinverse algorithm was generated Decision surface ldTested;If in the presence of into next step.Two class samples are from l in judging method of superposition for CposWith lneg Solstics xpmax' and xnmax' whether meet inequality:
It is linearly inseparable to meet.
5) two distance vector dis are generatedposWith disneg:Make one sample point x to classifying face l's of d (x, l) expressions Euclidean distance.More several classes of points in all C are then calculated successively to lnegDistance deposit disnegIn, the similarly minority in all C Class point is to lposDistance deposit disposIn.
Part III:Test unknown data
, it is necessary to detect that the unknown data of its class label substitutes into the model trained in the module, and made decision by model. If unknown sample is z.Test link comprises the following steps.
1)The Euclidean distance that test sample point z to two crosses barycenter classification plane is calculated, that is, obtains d (z, lpos) and d (z, lneg)。
2)For d (z, lpos) enter disposIt is compared, obtains the probability that z is classified into minority class, calculation formula is as follows:
Wherein, molecule is disposMiddle numerical value is more than d (z, lpos) element number, i.e. C focus on lposDistance is than test sample Z to lposApart from remote minority class number of samples.Denominator is minority class total sample number in C subsets.
3)Similarly, d (z, l are comparedneg) and disneg, calculation formula is as follows:
4)Compare P (ypos~Z) with P (yneg~Z), z class label is finally decided to be the larger one side of probability.
5)If P (ypos~Z) with P (yneg~Z) equal, z is by ldThe decision-making equation of composition is determined.At this moment, algorithm degenerates to original Pseudoinverse technique.
Experimental result
1)Experimental data set is chosen:The experimental selection website Extraction based on Evolutionary that increase income Learning (KEEL) dataset repository six unbalanced datasets.Choose class number, the sample of data set Dimension, scale(Total sample number)And unbalance factor IR row are in the following table.Wherein IR be more than 9 for moderate above unbalanced data Collection,
Wherein, n is madenegFor more several classes of sample numbers, nposFor minority class sample number, unbalance factor IR calculation formula is:
All data sets used are handled using five folding interleaved modes, i.e., data set is divided into substantially uniform five parts, each Secondary selection a copy of it is as test data, and four parts are training data in addition.Do not repeat to choose test data five times.
2)Contrast algorithm:Pseudoinverse algorithm, referred to as BEPILD are cleared up in core algorithm, i.e. border used in invention.In addition, We select algorithm on the basis of kNN, SVM (Linear), SVM (Poly), SVM (RBF).The parameter description of each algorithm and codomain Such as following table is set.
3)Performance metric method:Experiment is unified to use area under Receiver operating curve's line(the Area Under the Receiver operating characteristic Curve, AUC)To record classification knot of the distinct methods to each data set Really.Result is the result obtained when correspondence algorithm is configured on the data set using optimized parameter, i.e. optimal result.AUC's Calculation formula is:
Wherein TP is real class rate, and FP is false positive class rate, and TN is very negative class rate, and FN is false negative class rate.The relation of four indexs is such as Following table.
It is that BEPILD is contrasted with benchmark algorithm first, in an experiment, because nearest neighbor algorithm(kNN)Put down when neighbour's number k takes 1 Equal AUC, in order to highlight, only lists 1NN result higher than other numerical value are taken in form.The preferably knot of each data set Fruit is labeled as runic.As a result such as following table.
As table understands that BEPILD obtains highest AUC, i.e. optimal result on all data sets.
Then, by the BEPILD of proposition and average used time of the classics kNN algorithms on six data sets(Training time is with surveying The sum of examination time)In the following table, the best result of each data set is labeled as runic to record:
The result in table is it is recognized that while BEPILD has two processes of training and test, and kNN only has as Lazy learning and tested Journey, but from the time, BEPILD is more efficient.

Claims (4)

1. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border, it is characterised in that:Comprise the concrete steps that:
1), sample collection:Backstage is described according to specific imbalance problem, and the sample collected is changed into can be for follow-up calculation The vector model of method processing;
2)Train for the first time, obtain three categorised decision faces:First by the classification policy based on Pseudoinverse algorithm to training sample The categorised decision face marked off substantially is put, and further generates two parallel to the categorised decision face and two class sample barycenter of mistake New decision surface;Only it is located at two samples for crossing barycenter decision surface intermediate space and is kept as candidate subset, remaining sample quilt Remove;
3)Second of training, obtains two distance vectors:Obtain the excessive several classes of barycenter of more several classes of sample points distance of each candidate The distance of decision surface, minority class makees same processing, and the distance of the respective sample of two classes generates two distance vectors;
4)Test phase, calculates the probability that test sample point belongs to two class candidate samples, makes final decision:Currently tested Sample point is crossed to two after the distance of barycenter decision surface, and the two class distance vectors generated with the two distances and training module are distinguished Compare, final decision is made by judging distance of the test sample point on which side closer to barycenter decision surface.
2. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature It is:Described training stage first time, obtaining the specific steps in three categorised decision faces includes:Given birth to first with Pseudoinverse algorithm Constituent class decision surface ld;Secondly two class training sample barycenter are crossed, make two classifying face l parallel to ldposWith lneg;Afterwards by position In lposWith lnegBetween training sample be left as candidate subset C, remaining sample remove;Finally, it is that further simplify calculates, Then judge current data set whether linear separability;If current data set linear separability, it can be obtained immediately with Pseudoinverse algorithm Categorised decision face, if linearly inseparable, into next step.
3. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature It is:Second described of training stage, the details for obtaining two distance vectors is:Generate two distance vector disposWith disneg;Wherein, disposI-th of element representation candidate subset C in i-th of minority class sample arrived minority class sample barycenter With decision surface ldParallel hyperplane lposDistance, similarly disnegUse identical method for expressing.
4. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature It is:Described test phase, calculating the specific steps for the probability that test sample point belongs to two class candidate samples includes:First, Calculate test sample point z to two and cross barycenter classification plane lposWith lnegEuclidean distance, to lposDistance be expressed as d (z, lpos), to lnegDistance be expressed as d (z, lneg);Secondly, by d (z, lpos) and distance vector disposInterior element compares one by one Compared with by disposMiddle numerical value is more than d (z, lpos) element number as molecule, by disposElement sum be used as denominator, two A probability is worth to, this probability is the probability that test sample point z belongs to minority class, similarly, compares disposWith d (z, lpos) obtain the probability that test sample belongs to more several classes of, finally compare two probability, z class label be finally decided to be probability compared with Big one side.
CN201710211459.XA 2017-04-01 2017-04-01 A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border Pending CN107092927A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710211459.XA CN107092927A (en) 2017-04-01 2017-04-01 A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710211459.XA CN107092927A (en) 2017-04-01 2017-04-01 A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border

Publications (1)

Publication Number Publication Date
CN107092927A true CN107092927A (en) 2017-08-25

Family

ID=59646424

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710211459.XA Pending CN107092927A (en) 2017-04-01 2017-04-01 A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border

Country Status (1)

Country Link
CN (1) CN107092927A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492096A (en) * 2018-10-23 2019-03-19 华东理工大学 A kind of unbalanced data categorizing system integrated based on geometry
CN110348481A (en) * 2019-06-05 2019-10-18 华东理工大学 One kind being based on the gravitational network inbreak detection method of neighbour's sample

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492096A (en) * 2018-10-23 2019-03-19 华东理工大学 A kind of unbalanced data categorizing system integrated based on geometry
CN110348481A (en) * 2019-06-05 2019-10-18 华东理工大学 One kind being based on the gravitational network inbreak detection method of neighbour's sample
CN110348481B (en) * 2019-06-05 2023-04-28 华东理工大学 Network intrusion detection method based on universal gravitation of neighbor samples

Similar Documents

Publication Publication Date Title
Jiang et al. Beyond synthetic noise: Deep learning on controlled noisy labels
CN104573669A (en) Image object detection method
CN112070128A (en) Transformer fault diagnosis method based on deep learning
CN108416373A (en) A kind of unbalanced data categorizing system based on regularization Fisher threshold value selection strategies
CN113988215B (en) Power distribution network metering cabinet state detection method and system
Das et al. An oversampling technique by integrating reverse nearest neighbor in SMOTE: Reverse-SMOTE
CN108877947A (en) Depth sample learning method based on iteration mean cluster
Gohar et al. Terrorist group prediction using data classification
Wang et al. DPGCN model: a novel fault diagnosis method for marine diesel engines based on imbalanced datasets
Yuan et al. Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring
Wan et al. Logit inducing with abnormality capturing for semi-supervised image anomaly detection
CN114093445B (en) Patient screening marking method based on partial multi-marking learning
Ghosh et al. Leukox: leukocyte classification using least entropy combiner (lec) for ensemble learning
CN107092927A (en) A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border
Liu et al. A dual-branch balance saliency model based on discriminative feature for fabric defect detection
Liu et al. MRD-NETS: multi-scale residual networks with dilated convolutions for classification and clustering analysis of spacecraft electrical signal
CN108898157B (en) Classification method for radar chart representation of numerical data based on convolutional neural network
Palaniappan et al. Diagnosis of acute respiratory syndromes from x-rays using customised CNN architecture
Zha et al. Recognizing plans by learning embeddings from observed action distributions
Tusar et al. Detecting chronic kidney disease (CKD) at the initial stage: A novel hybrid feature-selection method and robust data preparation pipeline for different ML techniques
Ma et al. A Novel Fuzzy Neural Network Architecture Search Framework for Defect Recognition With Uncertainties
Abhilasa et al. Classification of agricultural leaf images using hybrid combination of activation functions
Dwivedi et al. EMViT-Net: A novel transformer-based network utilizing CNN and multilayer perceptron for the classification of environmental microorganisms using microscopic images
Zhou et al. Imbalanced Multi-Fault Diagnosis via Improved Localized Feature Selection
Pristyanto et al. Comparison of ensemble models as solutions for imbalanced class classification of datasets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170825