CN107092927A - A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border - Google Patents
A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border Download PDFInfo
- Publication number
- CN107092927A CN107092927A CN201710211459.XA CN201710211459A CN107092927A CN 107092927 A CN107092927 A CN 107092927A CN 201710211459 A CN201710211459 A CN 201710211459A CN 107092927 A CN107092927 A CN 107092927A
- Authority
- CN
- China
- Prior art keywords
- distance
- sample
- class
- pos
- decision
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/245—Classification techniques relating to the decision surface
- G06F18/2451—Classification techniques relating to the decision surface linear, e.g. hyperplane
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The present invention provides a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border, and the sample of collection is switched into vector first;Original decision surface is generated with Pseudoinverse algorithm, and constructs the two new decision surfaces for crossing two class barycenter parallel to the decision surface and respectively, retains the sample being located between new decision surface and makees candidate, remove remaining sample;Then the distance that the more several classes of samples of candidate arrived such barycenter hyperplane is calculated, minority class makees same treatment, constitutes such distance vector;Finally by the distance of test sample o'clock to two new decision surfaces, compared respectively with two class distance vectors and count the number for the distance for being shorter than candidate samples and plane.Test sample is predicted to be that more class of number.Compared to traditional sorting technique, the present invention is trained by two steps, has merged the Pseudoinverse algorithm based on border and the heuristic nearest neighbor algorithm based on non-border, improved classification accuracy, and greatly shorten debug time.
Description
Technical field
The present invention relates to Pattern classification techniques field, more particularly to a kind of side for unbalanced dataset being identified processing
Clear up pseudoinverse technique and system in boundary.
Background technology
Pattern-recognition is research and utilization computer to imitate or realize the recognition capability of the mankind or other animals, so as to grinding
Study carefully the task that object completes automatic identification.In recent years, mode identification technology be widely used in artificial intelligence, machine learning,
Computer engineering, robotics, Neurobiology, medical science, detective learn and archaeology, geological prospecting, Astronautics and weapon
Many key areas such as technology.But, with the expansion of application field, traditional mode identification technology is faced with new challenges.Its
In prominent challenge come from unbalanced data process problem.Unbalanced data is such a data, being permitted inside it
In multi-class, the sample size of some classifications is much smaller than the sample size of remaining classification.For simplicity, the few class of sample number is called few
Several classes of, the class more than sample number is called more several classes of.In practical application, cost of the minority class often than several classes of wrong point more is big, for example, cure
When treating diagnosis, the cost for judging a potential sufferer by accident is bigger than the people of one actual health of erroneous judgement.Similarly, error detection,
Hard measurement, financing prediction, medical treatment the field such as detect and there are a large amount of unbalanced datas.
Traditional method for classifying modes is when handling imbalance problem, due to the influence of more several classes of samples, frequently results in partially
The larger result of difference.In order to solve imbalance problem, some specific methods are devised.At present, specifically designed for imbalance
The method of problem can be divided into three classes:The first kind is the method based on data set, and such method is pre- in pattern by Sampling techniques
Processing links cut down unnecessary more several classes of, or generation minority class, data set is tended to balance, then substitute into follow-up traditional classification mould
Type.Such, which represents algorithm, includes unilateral down-sampling algorithm(One Side Selection)Algorithm is up-sampled with artificial minority class
(Synthetic Minority Oversampling Technique)Deng;Equations of The Second Kind is the method based on cost-sensitive, such
Method assigns the cost of different weights by the sample to mistake point, thus correct conventional model due to unbalanced data cause it is inclined
Difference, it is however generally that, the minority class sample of mistake point obtains the cost higher than more several classes of samples of mistake point.Such, which represents algorithm, includes
Projection algorithm is protected by cost-sensitive office(Cost-sensitive locality preserving projections), cost-sensitive
Principal Component Analysis Algorithm(Cost-sensitive principal component analysis)And cost-sensitive discriminant analysis
Algorithm(Cost-sensitive linear discriminant analysis)Deng;3rd class is to be based on inheritance method, such
Method carries out comprehensive descision by the way that different weak typing methods is synthesized together to data set.Such, which represents algorithm, includes
AdaCost etc..
At present, three class methods all exist each not enough.First kind method is realized and is easier to, but is needed mostly in training module generation
Enter all or at least most of sample, for mobility is strong or can not normal process the problem of not enough prior information.And the first kind
Method based on sampling is primarily now used in combination with other method, can not be independently as the model of a solution problem.
Equations of The Second Kind method and the 3rd class method are then often complicated, it is necessary to adjust quantity of parameters to obtain optimal value.Equations of The Second Kind method
Calculation cost is caught up with outward to be needed to travel through most numerical example, causes efficiency to reduce.3rd class method is needed with batch of data substitution
Different subclassification models, equally reduces efficiency.If can design simple for structure, parameter is less, and can correct a deviation very well
Method, it will further improve disposal ability of the Pattern classification techniques on imbalance problem.
The content of the invention
For prior art construction is complicated, inefficiency and precision is not high, it is impossible to meet it is extensive, in real time or lack elder generation
The imbalance problem of knowledge is tested, the invention provides a kind of unbalanced dataset classification side for the Pseudoinverse algorithm cleared up based on border
Method, using the typical linear sorting technique based on Boundary algorithm --- Pseudoinverse algorithm carries out training for the first time to small-scale sample and obtained
Candidate subset is taken, second of training is carried out to garbled sample using heuristic class boundary condition method obtains border, finally
Unknown sample is tested by the smeared out boundary and measuring similarity strategy of acquisition, so as to ensure unbalanced dataset point
While class accuracy, efficiency is improved in modelling and the aspect of model calculation two.
The technical solution adopted for the present invention to solve the technical problems(By taking two class imbalance problems as an example):Backstage root first
Described according to specific imbalance problem, the sample collected is changed into the vector model that can be handled for subsequent algorithm.Secondly,
Training dataset and test data set two parts will be divided into the data set of vector representation.If data set is limited, can all it use
Train.In training step, training for the first time is marked off greatly using the classification policy based on Pseudoinverse algorithm to training sample point
The categorised decision face of cause, and further generate parallel to the categorised decision face and cross two new decision surfaces of Different categories of samples barycenter.
Only it is located at two samples for crossing barycenter decision surface intermediate space and is kept as candidate subset, remaining sample is removed.Second
During secondary training, the distance of the excessive several classes of barycenter decision surfaces of more several classes of sample points distance of each candidate is obtained, minority class is made same
Processing, the distance of each sample of two classes generates two distance vectors.3rd, in test phase, obtain current test sample point and arrive
Two are crossed after the distance of barycenter decision surface, and the two class distance vectors generated with the two distances and training module are respectively compared, and are led to
Cross the distance for judging test sample point on which side and make final decision closer to barycenter decision surface.When two back gauges are equal
When, algorithm uses the two class decision surfaces generated for the first time in training module to be judged that whole method deteriorates to original pseudoinverse
Algorithm.Finally, the class label that output is determined.
The technical solution adopted for the present invention to solve the technical problems can also be further perfect.The of the training module
In training step, pseudoinverse technique, which finds decision surface, can use various improved methods, as long as the method for amendment ensures multiple
Miscellaneous degree and training speed.Further, the model can also use any suitable linear classification model to replace Pseudoinverse algorithm.
But due to Pseudoinverse algorithm, structure is most simple in all linear classification models, and invention is still put into practice with Pseudoinverse algorithm.In addition,
After the first time training step of described training module, when generating two decision surfaces for crossing barycenter, in order to improve efficiency, Ke Yixian
Whether two class data of decision problem are overlapping, if non-overlapping, illustrate the data set linear separability, then need not perform subsequent step.
Finally, the measuring similarity step of second and test are trained, the method for measuring similarity of use is defaulted as Euclidean distance.But root
According to different situations, any metric form can be used, such as COS distance, mahalanobis distance.
The invention has the advantages that:Using the terseness that Pseudoinverse algorithm structure is cleared up based on border, realize to injustice
The rapid feedback of weighing apparatus problem;Trained by two steps, merged Pseudoinverse algorithm based on border and heuristic near based on non-border
Adjacent algorithm, improves classification accuracy;Due to the simplicity of pseudoinverse technique in itself, this method is set only to preset a parameter, significantly
Shorten debug time;The border of this method formation is fuzzy, therefore still ensures that deviation will not when number of training is few
It is excessive.
Brief description of the drawings
Fig. 1 is the system framework of the unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border of the present invention.
Fig. 2 is flow chart of the present invention in training step.
Fig. 3 is flow chart of the present invention in testing procedure.
Embodiment
The invention will be described further with reference to the accompanying drawings and examples:The method of the present invention is divided into three modules.
Part I:Data acquisition
Data acquisition is that, by the imbalance problem digitization in reality, the data set of generation vector representation is easy to subsequent module
Handled.One sample generates a vector x, an attribute of the corresponding sample of vectorial each element, and vectorial dimension d is
It is as shown below for the attribute number of sample:
In two class problems, minority class sample and more several classes of samples are respectively merged into the matrix X of a matrix, i.e. minority classposWith
More several classes of matrix Xneg, then two matrixes are merged into matrix X one bigall, this final matrix is whole problem
Data set, it is as shown below:
。
Part II:Train classification models
In this module, the data set collected will be trained in the core algorithm for substituting into invention.Key step is as follows.
1) Pseudoinverse algorithm generation categorised decision face l is utilizedd:Pseudoinverse algorithm is typical linear classification algorithm, it is therefore an objective to raw
Into a decision surface, fall the test sample on decision surface one side and be just judged as belonging to that classification with it with one side.Decision-making
Face equation is expressed as:
Pseudoinverse algorithm is first by data set matrix XallFirst row increase it is complete 1 vector, it is expanded to augmented matrix, below figure:
Optimal w and w is obtained by the formula of Pseudoinverse algorithm afterwards0.Formula is as follows:
Wherein, d is the theoretical class label parameter pre-set.
2) two class training sample barycenter are crossed, are made parallel to ldTwo classifying face lposWith lneg:The algorithm of two class barycenter is such as
Shown in lower formula,
。
3) candidate subset is generated:L will be located atposWith lnegBetween training sample be left as candidate subset C, remaining sample
Remove.
4) judge current data set whether linear separability:Two class samples in C are judged with the presence or absence of overlapping, if in the absence of i.e.
Current data set is linear separability, then normal linear disaggregated model can just complete identification, follow-up to use what Pseudoinverse algorithm was generated
Decision surface ldTested;If in the presence of into next step.Two class samples are from l in judging method of superposition for CposWith lneg
Solstics xpmax' and xnmax' whether meet inequality:
It is linearly inseparable to meet.
5) two distance vector dis are generatedposWith disneg:Make one sample point x to classifying face l's of d (x, l) expressions
Euclidean distance.More several classes of points in all C are then calculated successively to lnegDistance deposit disnegIn, the similarly minority in all C
Class point is to lposDistance deposit disposIn.
Part III:Test unknown data
, it is necessary to detect that the unknown data of its class label substitutes into the model trained in the module, and made decision by model.
If unknown sample is z.Test link comprises the following steps.
1)The Euclidean distance that test sample point z to two crosses barycenter classification plane is calculated, that is, obtains d (z, lpos) and d (z,
lneg)。
2)For d (z, lpos) enter disposIt is compared, obtains the probability that z is classified into minority class, calculation formula is as follows:
Wherein, molecule is disposMiddle numerical value is more than d (z, lpos) element number, i.e. C focus on lposDistance is than test sample
Z to lposApart from remote minority class number of samples.Denominator is minority class total sample number in C subsets.
3)Similarly, d (z, l are comparedneg) and disneg, calculation formula is as follows:
。
4)Compare P (ypos~Z) with P (yneg~Z), z class label is finally decided to be the larger one side of probability.
5)If P (ypos~Z) with P (yneg~Z) equal, z is by ldThe decision-making equation of composition is determined.At this moment, algorithm degenerates to original
Pseudoinverse technique.
Experimental result
1)Experimental data set is chosen:The experimental selection website Extraction based on Evolutionary that increase income
Learning (KEEL) dataset repository six unbalanced datasets.Choose class number, the sample of data set
Dimension, scale(Total sample number)And unbalance factor IR row are in the following table.Wherein IR be more than 9 for moderate above unbalanced data
Collection,
Wherein, n is madenegFor more several classes of sample numbers, nposFor minority class sample number, unbalance factor IR calculation formula is:
All data sets used are handled using five folding interleaved modes, i.e., data set is divided into substantially uniform five parts, each
Secondary selection a copy of it is as test data, and four parts are training data in addition.Do not repeat to choose test data five times.
2)Contrast algorithm:Pseudoinverse algorithm, referred to as BEPILD are cleared up in core algorithm, i.e. border used in invention.In addition,
We select algorithm on the basis of kNN, SVM (Linear), SVM (Poly), SVM (RBF).The parameter description of each algorithm and codomain
Such as following table is set.
3)Performance metric method:Experiment is unified to use area under Receiver operating curve's line(the Area Under the
Receiver operating characteristic Curve, AUC)To record classification knot of the distinct methods to each data set
Really.Result is the result obtained when correspondence algorithm is configured on the data set using optimized parameter, i.e. optimal result.AUC's
Calculation formula is:
Wherein TP is real class rate, and FP is false positive class rate, and TN is very negative class rate, and FN is false negative class rate.The relation of four indexs is such as
Following table.
It is that BEPILD is contrasted with benchmark algorithm first, in an experiment, because nearest neighbor algorithm(kNN)Put down when neighbour's number k takes 1
Equal AUC, in order to highlight, only lists 1NN result higher than other numerical value are taken in form.The preferably knot of each data set
Fruit is labeled as runic.As a result such as following table.
As table understands that BEPILD obtains highest AUC, i.e. optimal result on all data sets.
Then, by the BEPILD of proposition and average used time of the classics kNN algorithms on six data sets(Training time is with surveying
The sum of examination time)In the following table, the best result of each data set is labeled as runic to record:
The result in table is it is recognized that while BEPILD has two processes of training and test, and kNN only has as Lazy learning and tested
Journey, but from the time, BEPILD is more efficient.
Claims (4)
1. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border, it is characterised in that:Comprise the concrete steps that:
1), sample collection:Backstage is described according to specific imbalance problem, and the sample collected is changed into can be for follow-up calculation
The vector model of method processing;
2)Train for the first time, obtain three categorised decision faces:First by the classification policy based on Pseudoinverse algorithm to training sample
The categorised decision face marked off substantially is put, and further generates two parallel to the categorised decision face and two class sample barycenter of mistake
New decision surface;Only it is located at two samples for crossing barycenter decision surface intermediate space and is kept as candidate subset, remaining sample quilt
Remove;
3)Second of training, obtains two distance vectors:Obtain the excessive several classes of barycenter of more several classes of sample points distance of each candidate
The distance of decision surface, minority class makees same processing, and the distance of the respective sample of two classes generates two distance vectors;
4)Test phase, calculates the probability that test sample point belongs to two class candidate samples, makes final decision:Currently tested
Sample point is crossed to two after the distance of barycenter decision surface, and the two class distance vectors generated with the two distances and training module are distinguished
Compare, final decision is made by judging distance of the test sample point on which side closer to barycenter decision surface.
2. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature
It is:Described training stage first time, obtaining the specific steps in three categorised decision faces includes:Given birth to first with Pseudoinverse algorithm
Constituent class decision surface ld;Secondly two class training sample barycenter are crossed, make two classifying face l parallel to ldposWith lneg;Afterwards by position
In lposWith lnegBetween training sample be left as candidate subset C, remaining sample remove;Finally, it is that further simplify calculates,
Then judge current data set whether linear separability;If current data set linear separability, it can be obtained immediately with Pseudoinverse algorithm
Categorised decision face, if linearly inseparable, into next step.
3. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature
It is:Second described of training stage, the details for obtaining two distance vectors is:Generate two distance vector disposWith
disneg;Wherein, disposI-th of element representation candidate subset C in i-th of minority class sample arrived minority class sample barycenter
With decision surface ldParallel hyperplane lposDistance, similarly disnegUse identical method for expressing.
4. a kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border according to claim 1, its feature
It is:Described test phase, calculating the specific steps for the probability that test sample point belongs to two class candidate samples includes:First,
Calculate test sample point z to two and cross barycenter classification plane lposWith lnegEuclidean distance, to lposDistance be expressed as d (z,
lpos), to lnegDistance be expressed as d (z, lneg);Secondly, by d (z, lpos) and distance vector disposInterior element compares one by one
Compared with by disposMiddle numerical value is more than d (z, lpos) element number as molecule, by disposElement sum be used as denominator, two
A probability is worth to, this probability is the probability that test sample point z belongs to minority class, similarly, compares disposWith d (z,
lpos) obtain the probability that test sample belongs to more several classes of, finally compare two probability, z class label be finally decided to be probability compared with
Big one side.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710211459.XA CN107092927A (en) | 2017-04-01 | 2017-04-01 | A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710211459.XA CN107092927A (en) | 2017-04-01 | 2017-04-01 | A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107092927A true CN107092927A (en) | 2017-08-25 |
Family
ID=59646424
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710211459.XA Pending CN107092927A (en) | 2017-04-01 | 2017-04-01 | A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107092927A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492096A (en) * | 2018-10-23 | 2019-03-19 | 华东理工大学 | A kind of unbalanced data categorizing system integrated based on geometry |
CN110348481A (en) * | 2019-06-05 | 2019-10-18 | 华东理工大学 | One kind being based on the gravitational network inbreak detection method of neighbour's sample |
-
2017
- 2017-04-01 CN CN201710211459.XA patent/CN107092927A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109492096A (en) * | 2018-10-23 | 2019-03-19 | 华东理工大学 | A kind of unbalanced data categorizing system integrated based on geometry |
CN110348481A (en) * | 2019-06-05 | 2019-10-18 | 华东理工大学 | One kind being based on the gravitational network inbreak detection method of neighbour's sample |
CN110348481B (en) * | 2019-06-05 | 2023-04-28 | 华东理工大学 | Network intrusion detection method based on universal gravitation of neighbor samples |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jiang et al. | Beyond synthetic noise: Deep learning on controlled noisy labels | |
CN104573669A (en) | Image object detection method | |
CN112070128A (en) | Transformer fault diagnosis method based on deep learning | |
CN108416373A (en) | A kind of unbalanced data categorizing system based on regularization Fisher threshold value selection strategies | |
CN113988215B (en) | Power distribution network metering cabinet state detection method and system | |
Das et al. | An oversampling technique by integrating reverse nearest neighbor in SMOTE: Reverse-SMOTE | |
CN108877947A (en) | Depth sample learning method based on iteration mean cluster | |
Gohar et al. | Terrorist group prediction using data classification | |
Wang et al. | DPGCN model: a novel fault diagnosis method for marine diesel engines based on imbalanced datasets | |
Yuan et al. | Review of resampling techniques for the treatment of imbalanced industrial data classification in equipment condition monitoring | |
Wan et al. | Logit inducing with abnormality capturing for semi-supervised image anomaly detection | |
CN114093445B (en) | Patient screening marking method based on partial multi-marking learning | |
Ghosh et al. | Leukox: leukocyte classification using least entropy combiner (lec) for ensemble learning | |
CN107092927A (en) | A kind of unbalanced data categorizing system that Pseudoinverse algorithm is cleared up based on border | |
Liu et al. | A dual-branch balance saliency model based on discriminative feature for fabric defect detection | |
Liu et al. | MRD-NETS: multi-scale residual networks with dilated convolutions for classification and clustering analysis of spacecraft electrical signal | |
CN108898157B (en) | Classification method for radar chart representation of numerical data based on convolutional neural network | |
Palaniappan et al. | Diagnosis of acute respiratory syndromes from x-rays using customised CNN architecture | |
Zha et al. | Recognizing plans by learning embeddings from observed action distributions | |
Tusar et al. | Detecting chronic kidney disease (CKD) at the initial stage: A novel hybrid feature-selection method and robust data preparation pipeline for different ML techniques | |
Ma et al. | A Novel Fuzzy Neural Network Architecture Search Framework for Defect Recognition With Uncertainties | |
Abhilasa et al. | Classification of agricultural leaf images using hybrid combination of activation functions | |
Dwivedi et al. | EMViT-Net: A novel transformer-based network utilizing CNN and multilayer perceptron for the classification of environmental microorganisms using microscopic images | |
Zhou et al. | Imbalanced Multi-Fault Diagnosis via Improved Localized Feature Selection | |
Pristyanto et al. | Comparison of ensemble models as solutions for imbalanced class classification of datasets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170825 |