CN108694413A - Adaptively sampled unbalanced data classification processing method, device, equipment and medium - Google Patents
Adaptively sampled unbalanced data classification processing method, device, equipment and medium Download PDFInfo
- Publication number
- CN108694413A CN108694413A CN201810453102.7A CN201810453102A CN108694413A CN 108694413 A CN108694413 A CN 108694413A CN 201810453102 A CN201810453102 A CN 201810453102A CN 108694413 A CN108694413 A CN 108694413A
- Authority
- CN
- China
- Prior art keywords
- sample
- samples
- small number
- target
- unbalanced data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Complex Calculations (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of adaptively sampled unbalanced data classification processing methods, including:Obtain target majority sample number and target minority sample number;Adaptively sampled data processing is carried out to pending unbalanced data according to the target majority sample number and the target minority sample number, so that most sample numbers in treated the pending unbalanced data meet the target majority sample number, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.According to the demands of individuals of user, over-sampling and Undersampling technique is used in combination so that newly-generated sample set meets the needs of sorting algorithm is to data, improves the classification accuracy of uneven big data.
Description
Technical field
The present invention relates to uneven big data processing field more particularly to adaptively sampled unbalanced data classification processing sides
Method, device, equipment and medium.
Background technology
With being constantly progressive for technology, including interconnection net spee is promoted, mobile Internet updates, hardware technology is continuous
Development, data acquisition technology, memory technology, treatment technology obtain significant progress, and data just increase at an unprecedented rate,
We have come into the big data epoch.The data scale huge (volume) of big data generates high speed (velocity), form
Various (variety), data do not know characteristics such as (veracity) and traditional data analysis and digging technology are being applied to
Unprecedented challenge is encountered when big data field.
Data classification be data analysis and excavate in rudimentary algorithm, have a wide range of applications field and a lot of other
The basis of data analysis and mining algorithm.In big data, almost all of data set is all unbalanced data, unbalanced data
Refer to that at least one classification includes relatively other less samples of classification in data set.Data nonbalance problem is in real generation
It is widely present in boundary, especially in big data application field.For example, in internet text classification, the data of each classification are not
Balanced, and the often other data of group that we pay close attention to, such as the sensitive information on network, emerging topic etc.;In electricity
In sub- business application, a large amount of customer transaction data and behavioral data are all normal, and the often electronics quotient that we pay close attention to
Fraud in business and abnormal behaviour, these data are submerged in a large amount of normal behaviour data, belong to knockdown
Unbalanced dataset.Similar application also has medical diagnosis, Satellite Remote Sensing Data Classification etc..Therefore, uneven big data classification
It is key technical problem in the urgent need to address in national economy and social development, is with a wide range of applications.
Uneven big data leads to traditional classification learning algorithm since the quantitative difference of different classes of data sample is excessive
It is difficult the classifying quality obtained, unbalanced data in the prior art as shown in Figure 1 classification example, wherein circle are minority class
Sample, triangle are most class samples, and imbalance is than being 3:1, i.e., most class samples are 3 times of minority class sample, and actual
In large data sets, imbalance is than often 10000:1, it is even higher, therefore first need to carry out data before being classified
Pretreatment.
Existing imbalance big data preprocess method includes mainly for the over-sampling of minority class and for most classes
Lack sampling.Over-sampling refers to increasing minority class sample using certain methods and techniques, and lack sampling refers to using certain method
Most class samples are reduced with technology, the purpose of both methods is all the injustice by reducing large data sets to the adjustment of sample set
Weighing apparatus degree increases the accuracy of sorting algorithm.
Inventor has found that there are following technical problems for the prior art when implementing the embodiment of the present invention:Different classifications algorithm
And different application demand is more different than demand to the size of unbalanced dataset and the imbalance of data, can increase in over-sampling
Add the scale of training set, especially when original training set it is uneven than it is very big when, can obtain close to most class sample numbers
New synthesis minority class sample.It is assumed that original training is concentrated with 100 minority class samples, 10000 most class samples then need
9900 new minority class samples are synthesized using over-sampling so that final training examples number significantly increases, and on the one hand synthesizes
Sample excessively causes newly to synthesize sample largely to be repeated with existing sample, and the increase of another aspect data volume can reduce sorting algorithm
Performance.Although the scale of data can be reduced in Undersampling technique, especially when original training set it is uneven than it is very big when,
Data scale after lack sampling can greatly reduce, but in order to reach balance and excessive reduction majority class sample may also cause
With the loss of information, it is also possible to lead to the significant decrease of most class classifying qualities.
Invention content
In view of the above-mentioned problems, the purpose of the present invention is to provide a kind of adaptively sampled unbalanced data classification processing sides
Over-sampling and Undersampling technique is used in combination according to the demand data of sorting algorithm in method so that newly-generated sample set, which meets, to be divided
Demand of the class algorithm to data improves the classification accuracy of uneven big data.
In a first aspect, the present invention provides a kind of adaptively sampled unbalanced data classification processing methods, including:
Obtain target majority sample number and target minority sample number;
Pending unbalanced data is carried out according to the target majority sample number and the target minority sample number
Adaptively sampled data processing, so that described in most sample numbers in treated the pending unbalanced data meet
Target majority sample number, it is a small number of that a small number of sample numbers in treated the pending unbalanced data meet the target
Sample number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.
It is described according to the target majority sample number and the mesh in the first possible realization method of first aspect
The a small number of sample numbers of mark carry out adaptively sampled data processing to pending unbalanced data, so that the pending imbalance
Most sample numbers in data meet the target majority sample number, a small number of samples in the pending unbalanced data
Number meets the target minority sample number:
When a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample number, root
According to the number and minority class sample of most class samples in the k neighbours of each a small number of samples in the pending unbalanced data
Number determines the classification of corresponding a small number of samples;Wherein, the classification include noise sample, unstable sample, boundary sample and
Stablize sample;
The corresponding operation of the classification is carried out according to the classification of a small number of samples;Wherein, the operation includes deleting, protecting
It stays, replicate or synthesizes;
When most sample numbers in the pending unbalanced data are unsatisfactory for the target majority sample number, root
According to the number and minority class sample of most class samples in the k neighbours of each most samples in the pending unbalanced data
Number determines the classification of corresponding most samples;Wherein, the classification includes noise sample, boundary sample and stable sample;
The corresponding operation of the classification is carried out according to the classification of each most samples;Wherein, the operation includes deleting
It removes, retain and selectively removing;
Obtain final a small number of sample sets and final most sample sets, wherein final a small number of sample set numbers
Meet the target minority sample number, final most sample set numbers meet the target majority sample number.
The possible realization method of with reference to first aspect the first, in second of possible realization method of first aspect, institute
It states and includes according to the corresponding operation of the classification progress classification of a small number of samples:
When a small number of samples are the noise sample, a small number of samples are deleted;
When a small number of samples are the unstable sample, a small number of samples are added in a small number of sample sets, still
New sample is not replicated or generated to it, updates a small number of sample set numbers;
When a small number of samples are the boundary sample, according to reproduction ratio c=(the target minority sample number-institutes
State the number of unstable sample)/(number-of a small number of samples-noise sample in the pending unbalanced data is described
The number of unstable sample) a small number of samples are replicated, to obtain replicating sample, by the duplication sample and described few
Number sample is added in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is the reproduction ratio to replicate number
The absolute value of the difference that c subtracts one;
When a small number of samples are the stable sample, by neighbours' sample of a small number of samples and a small number of samples
It is synthesized, to obtain synthesis sample, the synthesis sample and a small number of samples is added in a small number of sample sets, and more
New a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, the duplication
Than c=(number of the target minority sample number-unstable sample)/(few in the pending unbalanced data
The number of the number-unstable sample of the number sample-noise sample).
The possible realization method of second with reference to first aspect, in the third possible realization method of first aspect, institute
It states and further includes according to the corresponding operation of the classification progress classification of a small number of samples:
Detect that calculating lacks number d=institutes when having traversed each minority sample in the pending unbalanced data
State the presently described a small number of sample set numbers of target minority sample number-;
The stable sample that corresponding number is randomly choosed according to the missing number d, by the neighbours of the stable sample
Sample is synthesized with the stable sample, and to obtain newly synthesizing sample, the new synthesis sample and the stable sample are added
Enter in a small number of sample sets.
The possible realization method of with reference to first aspect the first, in the 4th kind of possible realization method of first aspect, institute
It states and includes according to the corresponding operation of the classification progress classification of each most samples:
When most samples are the noise sample, most samples are deleted;
When most samples are the boundary sample, retain the boundary sample, most samples is added more
Number sample set;
When most samples are the stable sample, set according to the distance value of most sample to the surrounding k nearest neighbors
Determine probability of erasure and carry out selectively removing, most sample sets are added in not deleted most samples.
The 4th kind of possible realization method with reference to first aspect, in the 5th kind of possible realization method of first aspect, institute
State the deletion set according to the distance of most samples to the surrounding k nearest neighbor when most samples are the stable sample
Probability carries out selectively removing, and most sample sets, which are added, in not deleted most samples includes:
The distance d values that most sample arrives the k neighbours sample are calculated, with according to the setting of the size of the distance d values
The probability of erasure of the stable sample;
When detecting that the probability of erasure is greater than or equal to preset value, then the stable sample is deleted;Wherein, it is described away from
Smaller from d, then the probability of erasure is bigger.
When detecting that the probability of erasure is less than preset value, then retain the stable sample, most samples are added
Enter most sample sets;Wherein, the distance d is bigger, then the probability of erasure is smaller.
The 5th kind of possible realization method with reference to first aspect, in the 6th kind of possible realization method of first aspect,
It is described when most samples are the stable sample, according to the distance of most samples to surrounding k nearest neighbor
The probability of erasure of setting carries out selectively removing, and most sample sets, which are added, in not deleted most samples further includes:
Obtain for deleting most sample numbers-noise sample in pending unbalanced data described in number e=
Number-target majority sample the number;
Obtain the number f of current deleted most samples;
When the f is less than the e, to the carry out selectively removing of the stable sample;
Most sample sets are added in not deleted stable sample.
Second aspect, the present invention also provides adaptively sampled unbalanced data classification processing units, including:
Acquisition module, for obtaining target majority sample number and target minority sample number;
Processing module, for according to the target majority sample number and the target minority sample number to it is pending not
Equilibrium criterion carries out adaptively sampled data processing, so that most sample numbers in the pending unbalanced data meet
The target majority sample number, a small number of sample numbers in the pending unbalanced data meet the target minority sample
Number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.
The third aspect, the embodiment of the present invention additionally provide a kind of adaptively sampled unbalanced data sort processing device, packet
It includes processor, memory and is stored in the memory and is configured as the computer program executed by the processor,
The processor realizes the adaptively sampled unbalanced data point as described in above-mentioned any one when executing the computer program
Class processing method.
Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, described computer-readable to deposit
Storage media includes the computer program of storage, wherein the computer-readable storage is controlled when the computer program is run
Equipment where medium executes the adaptively sampled unbalanced data classification processing method described in above-mentioned any one.
Above-mentioned technical proposal has the following advantages that:Obtain target majority sample number and target minority sample number;According to
The target majority sample number and the target minority sample number carry out adaptively sampled number to pending unbalanced data
According to processing, so that most sample numbers in treated the pending unbalanced data meet the target majority sample
Number, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample number;Its
In, the adaptively sampled data processing includes over-sampling and lack sampling, realizes and is met according to user demand adaptive generation
The sample set of demand allows user to input the total number of samples needed and intentionally gets the uneven ratio of data set, according to
The adaptive combined use over-sampling of family demand and lack sampling method, while over-sampling is carried out to minority class sample, to most classes
Sample carries out lack sampling, ultimately generates the sample set for meeting user demand, effectively improves the classification accuracy of uneven big data.
Description of the drawings
Fig. 1 be in the prior art unbalanced data classification exemplary plot;
The adaptively sampled unbalanced data classification processing method flow diagram that Fig. 2 first embodiment of the invention provides;
Fig. 3 is the adaptively sampled unbalanced data classification processing method flow signal that second embodiment of the invention provides
Figure;
Fig. 4 is a small number of sample k nearest neighbor schematic diagrames in pending unbalanced data;
Fig. 5 is the adaptively sampled unbalanced data classification processing method flow signal that third embodiment of the invention provides
Figure;
Fig. 6 is the adaptively sampled unbalanced data classification processing method flow signal that fourth embodiment of the invention provides
Figure;
Fig. 7 is that a kind of adaptively sampled unbalanced data classification processing unit structure that fifth embodiment of the invention provides is shown
It is intended to;
Fig. 8 is the structural representation for the adaptively sampled unbalanced data sort processing device that sixth embodiment of the invention provides
Figure.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts
Embodiment shall fall within the protection scope of the present invention.
Embodiment one
Referring to Fig. 2, the adaptively sampled unbalanced data classification processing method flow that first embodiment of the invention provides is shown
It is intended to.
It should be noted that in uneven big data preprocessing process, in existing method certainly not according to user demand
The size and equilibrium ratio for the sample set for adapting to generate the sample set of meet demand, therefore obtaining cannot all meet actual demand,
And it is to increase the quantity of minority class sample, or it is to delete majority to use lack sampling method that oversampler method is used in existing method
Class sample is single accurate using oversampler method or single classification that cannot all effectively improve uneven big data using lack sampling method
Property.The number of samples that existing lack sampling method is deleted is limited, effective to general unbalanced dataset, but is difficult to meet not
Balance the classification demand of large data sets.
It should be noted that the adaptively sampled unbalanced data classification processing method be before classification to data into
Capable advanced processing is used for data prediction.
Adaptively sampled unbalanced data classification processing method provided in this embodiment can be executed by terminal device, described
Terminal device includes but not limited to:Mobile phone, laptop, tablet computer and desktop computer etc..
The adaptively sampled unbalanced data classification processing method is as follows:
S11, target majority sample number and target minority sample number are obtained.
In the present embodiment, the terminal device can obtain default sample set value input by user and the default injustice
Weigh data sample ratio, and it is more then to respectively obtain target according to the default sample set value and the default unbalanced data sample ratio
Number sample number and target minority sample number, wherein the terminal device receives the default sample set input by user
Value and the default unbalanced data sample ratio, or directly acquire the default sample set for the acquiescence that the terminal device preserves
Value and the default unbalanced data sample ratio, the present invention are not especially limited this.Wherein, the default sample set value is to use
The numerical values recited for the sample set that family is wanted after handling pending unbalanced data, the default sample set include more
Number sample and a small number of samples, then the default sample set value is the number of the number and a small number of samples of most samples.Wherein, institute
Default unbalanced data sample is stated than for the number of most samples described in the default sample set and a small number of samples
Several ratio, the default unbalanced data sample ratio can be most sample numbers/a small number of sample numbers, or
The minority sample number/most sample numbers, the present invention are not especially limited this.
Specifically, having acquired the default sample set value input by user and the default unbalanced data sample
Than after, it is assumed that the default sample set value is X, the default unbalanced data sample ratio=majority sample ratio/minority sample ratio
=a/b, then the target majority sample number A=X*a/ (a+b), the target minority sample number B=X*b/ (a+b).
In the present embodiment, it can directly acquire to obtain the target majority sample number and described by the terminal device
Target minority sample number provides input interface by the terminal device, the majority is directly acquired by the input interface
Sample number and the target minority sample number, or be to directly acquire to pre-save numerical value, acquiescence chooses the majority
Sample number and the target minority sample number.
S12, according to the target majority sample number and the target minority sample number to pending unbalanced data
Adaptively sampled data processing is carried out, so that most sample numbers in treated the pending unbalanced data meet
The target majority sample number, a small number of sample numbers in treated the pending unbalanced data meet the target
A small number of sample numbers;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.
In embodiments of the present invention, most sample numbers in the pending unbalanced data are unsatisfactory for the target
When most sample numbers, lack sampling processing is carried out to most samples in the pending unbalanced data, described pending
When a small number of sample numbers in unbalanced data are unsatisfactory for the target minority sample number, to the pending unbalanced data
In a small number of samples carry out over-sampling processing.
It should be noted that described in most sample numbers in treated the pending unbalanced data meet
It is more to meet the target for most sample numbers in treated the pending unbalanced data for target majority sample number
Most sample numbers in the error range of number sample numbers or treated the pending unbalanced data with it is described
Target majority sample number is consistent, described in a small number of sample numbers in treated the pending unbalanced data meet
It is few to meet the target for a small number of sample numbers in treated the pending unbalanced data for target minority sample number
A small number of sample numbers in the error range of number sample numbers or treated the pending unbalanced data with it is described
Target minority sample number is consistent, and the present invention is not especially limited this.
In embodiments of the present invention, there are following four situation, one, most samples in the pending unbalanced data
Number is unsatisfactory for the target majority sample number, and a small number of sample numbers in the pending unbalanced data meet the mesh
The a small number of sample numbers of mark, then only carry out lack sampling processing to most samples in the pending unbalanced data;Two, described to wait for
A small number of sample numbers in processing unbalanced data are unsatisfactory for the target minority sample number, the pending unbalanced data
In most sample numbers meet the target majority sample number, then only to a small number of samples in the pending unbalanced data
Example carries out over-sampling processing;Three, most samples in the pending unbalanced data are all unsatisfactory for corresponding with a small number of samples
The requirement of target number then respectively carries out at lack sampling most samples in the pending unbalanced data with a small number of samples
Reason, over-sampling processing;Four, most samples in the pending unbalanced data all meet corresponding target with a small number of samples
Number requires, then without operation, the condition of satisfaction can be arranged at this time, be no longer to meet error range, to improve uneven big number
According to classification adaptability, more meet user demand.
Implement the present embodiment to have the advantages that:
Obtain target majority sample number and target minority sample number;According to the target majority sample number and described
Target minority sample number carries out adaptively sampled data processing to pending unbalanced data, so that being waited for described in treated
Most sample numbers in processing unbalanced data meet the target majority sample number, the pending injustice that treated
A small number of sample numbers in weighing apparatus data meet the target minority sample number;Wherein, the adaptively sampled data processing packet
Over-sampling and lack sampling are included, the sample set according to user demand adaptive generation meet demand is realized, allowing user to input needs
Total number of samples for wanting and the uneven ratio for intentionally getting data set, according to the adaptive combined use over-sampling of user demand
With lack sampling method, while to minority class sample carry out over-sampling, to most class samples carry out lack sampling, ultimately generate satisfaction use
The sample set of family demand effectively improves the classification accuracy of uneven big data, realizes in combination with using Undersampling technique
And oversampling technique increases the accuracy of sorting algorithm by reducing the degree of unbalancedness of large data sets to the adjustment of sample set.
Embodiment two
Referring to Fig. 3, the adaptively sampled unbalanced data classification processing method flow that second embodiment of the invention provides is shown
It is intended to.
It is described according to the target majority sample number and the target minority sample number to pending unbalanced data
Adaptively sampled data processing is carried out, so that most sample numbers in treated the pending unbalanced data meet
The target majority sample number, a small number of sample numbers in treated the pending unbalanced data meet the target
A small number of sample numbers include:
S21, a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample number
When, according to the number and minority class of most class samples in the k neighbours of each a small number of samples in the pending unbalanced data
The number of sample determines the classification of corresponding a small number of samples;Wherein, the classification includes noise sample, unstable sample, boundary
Sample and stablize sample;
It should be noted that the k values are more than 1, and it is integer, the k values, of the invention determines according to actual conditions
This is not especially limited.
It should be noted that when the overwhelming majority is most samples in neighbours' sample of a small number of samples, i.e., neighbours save
When most samples of point reach default value, then it is assumed that the minority sample is the noise sample;In a small number of samples
Most sample numbers are more relative to a small number of sample numbers in neighbours' sample, then it is assumed that the minority sample is the unstable random sample
Example;Most sample numbers are suitable with a small number of sample numbers in neighbours' sample of a small number of samples, without significant difference, then recognize
It is the boundary sample for a small number of samples;Most sample numbers are than a small number of samples in neighbours' sample of a small number of samples
Number is few, then it is assumed that the minority sample is the stable sample.
It should be noted that the error range of the comparison of each classification can be set according to actual conditions, for example, described
When most sample numbers differ less than -2 with a small number of sample numbers in neighbours' sample of a small number of samples, it is believed that the minority sample
Neighbours' sample in most sample numbers it is suitable with minority sample numbers;When most class samples in the neighbours are more than 2/3, then
Think that a small number of samples are the noise sample;Wherein, the error range may be set according to actual conditions, the present invention couple
This is not especially limited.
Specifically, a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample
When number, the number and minority of most class samples in the k neighbours of each a small number of samples in the pending unbalanced data are calculated
The number of class sample, general k value are set within 10, if k is set as 8, just checking in 8 neighbours has several minority class, several
A majority class, for example, 6 or more most class neighbours, then it is assumed that this minority class sample is noise, 5-6 most class, then for not
Stablize sample, 4, be boundary sample, 3 or less be to stablize sample.Referring to Fig. 4, most samples are represented with triangle shape, with circle
Represent a small number of samples, it is assumed that the K values are 6, and a small number of sample J of one in the pending unbalanced data then compare described
In 6 close samples of a small number of sample J necks, how many is a for most samples, a small number of samples how many again, and the minority sample J necks are closely
6 samples lived with circle circle, wherein most samples have 6, and a small number of samples have 0, then illustrate the neighbour of a small number of sample J
It is most samples to occupy the overwhelming majority in sample, then it is assumed that the minority sample J is noise sample;With the pending uneven number
For an a small number of sample M in, it is preferable that k values in primary pretreatment are identical, and the k values are 6, then described
Most sample numbers are 4 in neighbours' sample of a small number of sample M, and a small number of sample numbers are 2, then neighbours' sample of a small number of sample M
Most sample numbers are more than a small number of sample numbers in example, then a small number of sample M are the stable sample.
It should be noted that k neighbour's samples of each a small number of samples in the pending unbalanced data are exactly from institute
State K nearest sample of a small number of samples, with reference to figure 4 it is found that enclose minority sample J with the circle of a small number of sample M be not as it is big
It is small, thus illustrate, in k sample of the arest neighbors for obtaining a small number of samples.
S22, the corresponding operation of the classification is carried out according to the classification of a small number of samples;Wherein, the operation includes deleting
It removes, retain, replicate or synthesizes;
In embodiments of the present invention, judged according to the classification of a small number of samples, judge the valence of a small number of samples
Value, for example, when the processing of certain unbalanced data needs to reject noise sample, when reducing noise effect, by a small number of samples
It is deleted for the sample of noise;It needs to pay close attention to boundary sample in the processing of certain unbalanced data, then by a small number of samples
It is remained for boundary sample, and duplication operation is carried out to the boundary sample, the present invention is not especially limited this.
S23, most sample numbers in the pending unbalanced data are unsatisfactory for the target majority sample number
When, according to the number and minority class of most class samples in the k neighbours of each most samples in the pending unbalanced data
The number of sample determines the classification of corresponding most samples;Wherein, the classification includes noise sample, boundary sample and stable sample
Example;
In embodiments of the present invention, the classification discrimination principles of most samples and the classification of above-mentioned a small number of samples differentiate former
Manage identical, details are not described herein.
S24, the corresponding operation of the classification is carried out according to the classification of each most samples;Wherein, the operation packet
Include deletion, reservation and selectively removing;
In embodiments of the present invention, the classification according to each most samples carries out the corresponding operation of the classification
Principle is identical with above-mentioned a small number of samples, is all to be operated accordingly according to the value function of most samples, herein not
It repeats again.
S25, final a small number of sample sets and final most sample sets are obtained, wherein final a small number of sample sets
Number meets the target minority sample number, and final most sample set numbers meet the target majority sample
Number.
It should be noted that in the present embodiment, to a small number of samples and majority sample in the pending unbalanced data
The sequencing that the classification of example determines is not especially limited.
In the present embodiment, in the pending unbalanced data most samples and a small number of sample all carried out phase
After the operation answered, finally obtained a small number of sample sets and finally obtained most sample sets are obtained;For example, waiting locating to described
It manages all most samples in unbalanced data and the corresponding behaviour of the classification is all carried out according to the classification of each most samples
After work, i.e., each most samples have all carried out deleting, after reservation or the operation of selectively removing one, obtain last
Obtained most sample sets.
Implement the present embodiment to have the advantages that:
In the prior art, when carrying out over-sampling to a small number of samples and carrying out lack sampling to most samples, to all samples
Example is all uniformly processed, and is not distinguished to a small number of samples and most samples;By comparing the pending unbalanced data
In each most samples k neighbours in the number of most class samples and the number of minority class sample determine corresponding most samples
Classification, a small number of samples are divided into noise sample, boundary sample and stablize sample, different classes of different disposal, effectively
The quality of the later a small number of sample sets of uneven big data over-sampling is improved, and then improves the accuracy of sorting algorithm;Pass through ratio
The number of most class samples and minority class sample in the k neighbours of each most samples in the pending unbalanced data
Number determines the classification of corresponding most samples, and most samples are divided into noise sample, boundary sample and stablize sample,
Different classes of different disposal effectively improves the quality of the later most sample sets of uneven big data lack sampling, and then improves and divide
The accuracy of class algorithm.
Embodiment three
On the basis of embodiment two, the classification correspondence is carried out for step S22, according to the classification of a small number of samples
Operation describe in detail:
Referring to Fig. 5, the adaptively sampled unbalanced data classification processing method flow that third embodiment of the invention provides is shown
It is intended to.
The classification according to a small number of samples carries out the corresponding operation of the classification:
When a small number of samples are the noise sample, a small number of samples are deleted;
When a small number of samples are the unstable sample, a small number of samples are added in a small number of sample sets, still
New sample is not replicated or generated to it, updates a small number of sample set numbers;
When a small number of samples are the boundary sample, according to reproduction ratio c=(the target minority sample number-institutes
State the number of unstable sample)/(number-of a small number of samples-noise sample in the pending unbalanced data is described
The number of unstable sample) a small number of samples are replicated, to obtain replicating sample, by the duplication sample and described few
Number sample is added in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is the reproduction ratio to replicate number
The absolute value of the difference that c subtracts one;
When a small number of samples are the stable sample, by neighbours' sample of a small number of samples and a small number of samples
It is synthesized, to obtain synthesis sample, the synthesis sample and a small number of samples is added in a small number of sample sets, and more
New a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, the duplication
Than c=(number of the target minority sample number-unstable sample)/(few in the pending unbalanced data
The number of the number-unstable sample of the number sample-noise sample).
Preferably, described to further include according to the corresponding operation of the classification progress classification of a small number of samples:
Detect that calculating lacks number d=institutes when having traversed each minority sample in the pending unbalanced data
State the presently described a small number of sample set numbers of target minority sample number-;
In embodiments of the present invention, each minority sample carries out corresponding operating in the pending unbalanced data
Afterwards, to a small number of samples into line flag, and the number of a small number of samples indicated is recorded, in a small number of samples indicated
Number when reaching the number of a small number of samples in the pending unbalanced data, then traversed the pending imbalance
Each minority sample, the present invention do not make this tool and limit in data.
The stable sample that corresponding number is randomly choosed according to the missing number d, by the neighbours of the stable sample
Sample is synthesized with the stable sample, and to obtain newly synthesizing sample, a small number of samples are added in the new synthesis sample
It concentrates.
It should be noted that the d values are positive integer, when the d is less than 1, then the new synthesis sample is no longer synthesized
Example.
Specifically, judging the classification of each minority sample in the pending unbalanced data, it is in a small number of samples
When the noise sample, a small number of samples are deleted;When a small number of samples are the unstable sample, by the minority
Sample is added in a small number of sample sets, but is not replicated or generated new sample to it, updates a small number of sample set numbers;
When a small number of samples are the boundary sample, according to reproduction ratio c=, (the target minority sample number-is described unstable
The number of sample)/(number-unstable random sample of a small number of samples-noise sample in the pending unbalanced data
The number of example) a small number of samples are replicated, to obtain replicating sample, the duplication sample and a small number of samples are added
Enter in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is what the reproduction ratio c subtracted one to replicate number
Absolute value of the difference;When a small number of samples are the stable sample, by neighbours' sample of a small number of samples and the minority
Sample is synthesized, and to obtain synthesis sample, the synthesis sample and a small number of samples are added in a small number of sample sets,
And update a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, described
Reproduction ratio c=(number of the target minority sample number-unstable sample)/(in the pending unbalanced data
A small number of samples-noise sample the number-unstable sample number), to the pending unbalanced data
In an a small number of samples carry out corresponding operatings after, detect whether to have had stepped through the institute in the pending unbalanced data
There are a small number of samples, for example, after one in the pending unbalanced data a small number of samples are deleted, detection current time is
The no all a small number of samples having had stepped through in the pending unbalanced data;It detects and has traversed the pending injustice
In the data that weigh when each a small number of samples, the presently described a small number of sample sets of target minority sample number-described in missing number d=are calculated
Number;The stable sample that corresponding number is randomly choosed according to the missing number d, by neighbours' sample of the stable sample
It is synthesized with the stable sample, to obtain newly synthesizing sample, institute is added in the new synthesis sample and the stable sample
It states in a small number of sample sets.
Implement the present embodiment to have the advantages that:
Classification is accurately divided to a small number of samples in the pending unbalanced data, according in pending unbalanced data
A small number of sample k neighbours in most class samples number, minority class sample is divided into noise sample, unstable sample, boundary
Sample and stablize sample, it is different classes of to handle respectively, the matter of minority class sample after uneven big data over-sampling is improved with this
Amount, and then improve the accuracy of uneven big data sorting algorithm.
Example IV
On the basis of embodiment two, the classification is carried out for step S24, according to the classification of each most samples
Corresponding operation describes in detail:
Referring to Fig. 6, the adaptively sampled unbalanced data classification processing method flow that fourth embodiment of the invention provides is shown
It is intended to.
The classification according to each most samples carries out the corresponding operation of the classification:
When most samples are the noise sample, most samples are deleted;
When most samples are the boundary sample, retain the boundary sample, most samples is added more
Number sample set;
When most samples are the stable sample, set according to the distance value of most sample to the surrounding k nearest neighbors
Determine probability of erasure and carry out selectively removing, most sample sets are added in not deleted most samples.
Preferably, described when most samples are the stable sample, according to most samples to surrounding k nearest neighbor
The probability of erasure of distance setting carry out selectively removing, most sample sets, which are added, in not deleted most samples includes:
The distance d values that most sample arrives the k neighbours sample are calculated, with according to the setting of the size of the distance d values
The probability of erasure of the stable sample;
When detecting that the probability of erasure is greater than or equal to preset value, then the stable sample is deleted;Wherein, it is described away from
Smaller from d, then the probability of erasure is bigger.
When detecting that the probability of erasure is less than preset value, then retain the stable sample, most samples are added
Enter most sample sets;Wherein, the distance d is bigger, then the probability of erasure is smaller.
Preferably, described when most samples are the stable sample, according to most samples to surrounding k nearest neighbor
The probability of erasure of distance setting carry out selectively removing, most sample sets, which are added, in not deleted most samples also wraps
It includes:
Obtain for deleting most sample numbers-noise sample in pending unbalanced data described in number e=
Number-target majority sample the number;
Obtain the number f of current deleted most samples;
When the f is less than the e, to the carry out selectively removing of the stable sample;
Most sample sets are added in not deleted stable sample.
Specifically, judging the classification of each majority sample in the pending unbalanced data, it is in most samples
When the noise sample, most samples are deleted;When most samples are the boundary sample, retain the boundary sample
Most sample sets are added in most samples by example;When most samples are the stable sample, according to most samples
The distance value setting probability of erasure of example to surrounding k nearest neighbor carries out selectively removing, not deleted most samples is added described more
Number sample sets, that is, calculate most sample to the k neighbours sample distance d values, to be set according to the size of the distance d values
The probability of erasure of the fixed stable sample;When detecting that the probability of erasure is greater than or equal to preset value, then delete described steady
Random sample example;Wherein, the distance d is smaller, then the probability of erasure is bigger, is detecting the probability of erasure less than preset value
When, then retain the stable sample, most sample sets are added in most samples;Wherein, the distance d is bigger, then
The probability of erasure is smaller.Most samples each to the pending unbalanced data carry out after operating accordingly, and acquisition is deleted
Except the number-of most sample numbers-noise sample in pending unbalanced data described in the number e=target is most
Sample number;Obtain the number f of current deleted most samples;When the f is less than the e, to the stable sample
Carry out selectively removing;Most sample sets are added in not deleted stable sample.
Implement the present embodiment to have the advantages that:
Classification is accurately divided to most samples in the pending unbalanced data, according in pending unbalanced data
Most sample k neighbours in minority class sample number, majority class samples divide into noise sample, boundary sample and stablize sample
Example, it is different classes of to handle respectively, the quality of the later most class samples of uneven big data lack sampling is improved with this, and then improve
The classification accuracy of uneven big data.
It is a kind of adaptively sampled unbalanced data classification processing dress that fifth embodiment of the invention provides referring to Fig. 7, Fig. 7
Structural schematic diagram is set, including:
Acquisition module 71, for obtaining target majority sample number and target minority sample number;
Processing module 72 is used for according to the target majority sample number and the target minority sample number to pending
Unbalanced data carries out adaptively sampled data processing, so that most sample numbers in the pending unbalanced data are full
The foot target majority sample number, a small number of sample numbers in the pending unbalanced data meet the target minority sample
Example number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.
Preferably, the processing module 71 includes:
A small number of sample classification determination units are unsatisfactory for for a small number of sample numbers in the pending unbalanced data
When the target minority sample number, according to most in the k neighbours of each a small number of samples in the pending unbalanced data
The number of class sample and the number of minority class sample determine the classification of corresponding a small number of samples;Wherein, the classification includes noise
Sample, unstable sample, boundary sample and stable sample;
A small number of operating units, for carrying out the corresponding operation of the classification according to the classification of a small number of samples;Wherein, institute
It includes deleting, retain, replicate or synthesizing to state operation;
Most sample classification determination units are unsatisfactory for for most sample numbers in the pending unbalanced data
When the target majority sample number, according to most in the k neighbours of each most samples in the pending unbalanced data
The number of class sample and the number of minority class sample determine the classification of corresponding most samples;Wherein, the classification includes noise
Sample, boundary sample and stable sample;
Majority operation unit, for carrying out the corresponding operation of the classification according to the classification of each most samples;Its
In, the operation includes deletion, reservation and selectively removing;
Sample set acquiring unit, for obtaining final a small number of sample sets and final most sample sets, wherein it is described most
Whole a small number of sample set numbers meet the target minority sample number, and final most sample set numbers meet the mesh
The most sample numbers of mark.
Preferably, the minority class operating unit includes:
When a small number of samples are the noise sample, a small number of samples are deleted;
When a small number of samples are the unstable sample, a small number of samples are added in a small number of sample sets, still
New sample is not replicated or generated to it, updates a small number of sample set numbers;
When a small number of samples are the boundary sample, according to reproduction ratio c=(the target minority sample number-institutes
State the number of unstable sample)/(number-of a small number of samples-noise sample in the pending unbalanced data is described
The number of unstable sample) a small number of samples are replicated, to obtain replicating sample, by the duplication sample and described few
Number sample is added in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is the reproduction ratio to replicate number
The absolute value of the difference that c subtracts one;
When a small number of samples are the stable sample, by neighbours' sample of a small number of samples and a small number of samples
It is synthesized, to obtain synthesis sample, the synthesis sample and a small number of samples is added in a small number of sample sets, and more
New a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, the duplication
Than c=(number of the target minority sample number-unstable sample)/(few in the pending unbalanced data
The number of the number-unstable sample of the number sample-noise sample).
Preferably, a small number of operating units further include:
Detection unit calculates when having traversed each minority sample in the pending unbalanced data for detecting
Lack the presently described a small number of sample set numbers of target minority sample number-described in number d=;
Synthesis unit, the stable sample for randomly choosing corresponding number according to the missing number d will be described steady
Neighbours' sample of random sample example is synthesized with the stable sample, to obtain newly synthesizing sample, by the new synthesis sample and institute
Stable sample is stated to be added in a small number of sample sets.
Preferably, the majority operation unit includes:
Deleting unit, for when most samples are the noise sample, deleting most samples;
Stick unit will be described more for when most samples are the boundary sample, retaining the boundary sample
Most sample sets are added in number sample;
Selectively removing unit, for when most samples are the stable sample, being arrived according to most samples
The distance value setting probability of erasure of surrounding k nearest neighbor carries out selectively removing, and most samples are added in not deleted most samples
Example collection.
Preferably, the selectively removing unit includes:
The distance d values that most sample arrives the k neighbours sample are calculated, with according to the setting of the size of the distance d values
The probability of erasure of the stable sample;
When detecting that the probability of erasure is greater than or equal to preset value, then the stable sample is deleted;Wherein, it is described away from
Smaller from d, then the probability of erasure is bigger;
When detecting that the probability of erasure is less than preset value, then retain the stable sample, most samples are added
Enter most sample sets;Wherein, the distance d is bigger, then the probability of erasure is smaller.
Preferably, the selectively removing unit further includes:
Obtain for deleting most sample numbers-noise sample in pending unbalanced data described in number e=
Number-target majority sample the number;
Obtain the number f of current deleted most samples;
When the f is less than the e, to the carry out selectively removing of the stable sample;
Most sample sets are added in not deleted stable sample.
Implement the present embodiment to have the advantages that:
Obtain target majority sample number and target minority sample number;According to the target majority sample number and described
Target minority sample number carries out adaptively sampled data processing to pending unbalanced data, so that being waited for described in treated
Most sample numbers in processing unbalanced data meet the target majority sample number, the pending injustice that treated
A small number of sample numbers in weighing apparatus data meet the target minority sample number;Wherein, the adaptively sampled data processing packet
Over-sampling and lack sampling are included, the sample set according to user demand adaptive generation meet demand is realized, allowing user to input needs
Total number of samples for wanting and the uneven ratio for intentionally getting data set, according to the adaptive combined use over-sampling of user demand
With lack sampling method, while to minority class sample carry out over-sampling, to most class samples carry out lack sampling, ultimately generate satisfaction use
The sample set of family demand effectively improves the classification accuracy of uneven big data.
Fig. 8 is referred to, Fig. 8 is the adaptively sampled unbalanced data sort processing device that sixth embodiment of the invention provides
Schematic diagram, for executing adaptively sampled unbalanced data classification processing method provided in an embodiment of the present invention, such as Fig. 8 institutes
Show, which includes:At least one processor 11, such as CPU, at least one net
Network interface 14 or other users interface 13, memory 15, at least one communication bus 12, communication bus 12 is for realizing these
Connection communication between component.Wherein, user interface 13 may include optionally USB interface and other standards interface, it is wired
Interface.Network interface 14 may include optionally Wi-Fi interface and other wireless interfaces.Memory 15 may include high speed
RAM memory, it is also possible to further include non-labile memory (non-volatilememory), a for example, at least disk is deposited
Reservoir.Memory 15 can include optionally at least one storage device for being located remotely from aforementioned processor 11.
In some embodiments, memory 15 stores following element, executable modules or data structures, or
Their subset or their superset:
Operating system 151, including various system programs, for realizing various basic businesses and hardware based of processing
Business;
Program 152.
Specifically, processor 11 executes oneself described in above-described embodiment for calling the program 152 stored in memory 15
Adapt to sampling unbalanced data classification processing method.
Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it
His general processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor
Deng, the processor is the control centre of the adaptively sampled unbalanced data classification processing method, using various interfaces and
The various pieces of the entire adaptively sampled unbalanced data classification processing method of connection.
The memory can be used for storing the computer program and/or module, and the processor is by running or executing
Computer program in the memory and/or module are stored, and calls the data being stored in memory, is realized uneven
Weighing apparatus data are classified the various functions of pretreated electronic device.The memory can include mainly storing program area and storage data
Area, wherein storing program area can storage program area, needed at least one function application program (such as sound-playing function,
Text conversion function etc.) etc.;Storage data field can be stored uses created data (such as audio data, text according to mobile phone
Word message data etc.) etc..In addition, memory may include high-speed random access memory, can also include non-volatile memories
Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure
Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid
State memory device.
Wherein, if the module of the adaptively sampled unbalanced data classification is realized in the form of SFU software functional unit simultaneously
When sold or used as an independent product, it can be stored in a computer read/write memory medium.Based on such reason
Solution, the present invention realize all or part of flow in above-described embodiment method, can also instruct correlation by computer program
Hardware complete, the computer program can be stored in a computer readable storage medium, which exists
When being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer journey
Sequence code, the computer program code can be source code form, object identification code form, executable file or certain intermediate shapes
Formula etc..The computer-readable medium may include:Any entity or device, note of the computer program code can be carried
Recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory),
Random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium
Deng.It should be noted that the content that the computer-readable medium includes can be real according to legislation in jurisdiction and patent
The requirement trampled carries out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium
Do not include electric carrier signal and telecommunication signal.
It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separating component
The unit of explanation may or may not be physically separated, and the component shown as unit can be or can also
It is not physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to actual
It needs that some or all of module therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention
In embodiment attached drawing, the connection relation between module indicates there is communication connection between them, specifically can be implemented as one or
A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can to understand
And implement.
The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as
Protection scope of the present invention.
It should be noted that in the above-described embodiments, all emphasizing particularly on different fields to the description of each embodiment, in some embodiment
In the part that is not described in, may refer to the associated description of other embodiment.Secondly, those skilled in the art should also know
It knows, embodiment described in this description belongs to preferred embodiment, and involved action and simulation must be that the present invention must
Must.
Claims (10)
1. a kind of adaptively sampled unbalanced data classification processing method, which is characterized in that including:
Obtain target majority sample number and target minority sample number;
Pending unbalanced data is carried out according to the target majority sample number and the target minority sample number adaptive
Sampled-data processing is answered, so that most sample numbers in treated the pending unbalanced data meet the target
Most sample numbers, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample
Number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.
2. adaptively sampled unbalanced data classification processing method according to claim 1, which is characterized in that the basis
The target majority sample number and the target minority sample number carry out adaptively sampled number to pending unbalanced data
According to processing, so that most sample numbers in treated the pending unbalanced data meet the target majority sample
Number, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample number packet
It includes:
When a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample number, according to institute
State the number of the number and minority class sample of most class samples in the k neighbours of each a small number of samples in pending unbalanced data
Determine the classification of corresponding a small number of samples;Wherein, the classification includes noise sample, unstable sample, boundary sample and stabilization
Sample;
The corresponding operation of the classification is carried out according to the classification of a small number of samples;Wherein, it is described operation include delete, reservation,
It replicates or synthesizes;
When most sample numbers in the pending unbalanced data are unsatisfactory for the target majority sample number, according to institute
State the number of the number and minority class sample of most class samples in the k neighbours of each most samples in pending unbalanced data
Determine the classification of corresponding most samples;Wherein, the classification includes noise sample, boundary sample and stable sample;
The corresponding operation of the classification is carried out according to the classification of each most samples;Wherein, the operation includes deleting, protecting
It stays and selectively removing;
Obtain final a small number of sample sets and final most sample sets, wherein final a small number of sample set numbers meet
The target minority sample number, final most sample set numbers meet the target majority sample number.
3. adaptively sampled unbalanced data classification processing method according to claim 2, which is characterized in that the basis
The classification of the minority sample carries out the corresponding operation of the classification:
When a small number of samples are the noise sample, a small number of samples are deleted;
When a small number of samples are the unstable sample, a small number of samples are added in a small number of sample sets, but it is not right
It is replicated or is generated new sample, updates a small number of sample set numbers;
When a small number of samples are the boundary sample, according to reproduction ratio c=, (the target minority sample number-is described not
Stablize the number of sample)/(number-shakiness of a small number of samples-noise sample in the pending unbalanced data
The number of random sample example) a small number of samples are replicated, to obtain replicating sample, by the duplication sample and a small number of samples
Example is added in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is that the reproduction ratio c subtracts to replicate number
One absolute value of the difference;
When a small number of samples are the stable sample, neighbours' sample of a small number of samples and a small number of samples are carried out
Synthesis the synthesis sample and a small number of samples is added in a small number of sample sets, and update institute with obtaining synthesis sample
State a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, the reproduction ratio c
=(number of the target minority sample number-unstable sample)/(a small number of samples in the pending unbalanced data
The number of the number-unstable sample of example-noise sample).
4. adaptively sampled unbalanced data classification processing method according to claim 3, which is characterized in that the basis
The classification of the minority sample carries out the corresponding operation of the classification:
Detect that calculating lacks mesh described in number d=when having traversed each minority sample in the pending unbalanced data
The presently described a small number of sample set numbers of a small number of sample numbers-of mark;
The stable sample that corresponding number is randomly choosed according to the missing number d, by neighbours' sample of the stable sample
It is synthesized with the stable sample, to obtain newly synthesizing sample, institute is added in the new synthesis sample and the stable sample
It states in a small number of sample sets.
5. adaptively sampled unbalanced data classification processing method according to claim 2, which is characterized in that the basis
The classification of each most samples carries out the corresponding operation of the classification:
When most samples are the noise sample, most samples are deleted;
When most samples are the boundary sample, retain the boundary sample, most samples are added in most samples
Example collection;
When most samples are the stable sample, deleted according to the distance value setting of most samples to surrounding k nearest neighbor
Except probability carries out selectively removing, most sample sets are added in not deleted most samples.
6. adaptively sampled unbalanced data classification processing method according to claim 5, which is characterized in that described in institute
When to state most samples be the stable sample, according to the probability of erasure of the distance setting of most samples to surrounding k nearest neighbor into
Row selectively removing, most sample sets, which are added, in not deleted most samples includes:
The distance d values that most sample arrives the k neighbours sample are calculated, described in being set according to the size of the distance d values
Stablize the probability of erasure of sample;
When detecting that the probability of erasure is greater than or equal to preset value, then the stable sample is deleted;Wherein, the distance d
Smaller, then the probability of erasure is bigger;
When detecting that the probability of erasure is less than preset value, then retain the stable sample, institute is added in most samples
State most sample sets;Wherein, the distance d is bigger, then the probability of erasure is smaller.
7. adaptively sampled unbalanced data classification processing method according to claim 5, which is characterized in that described in institute
When to state most samples be the stable sample, according to the probability of erasure of the distance setting of most samples to surrounding k nearest neighbor into
Row selectively removing, most sample sets, which are added, in not deleted most samples further includes:
Obtain the number-institute for deleting most sample numbers-noise sample in pending unbalanced data described in number e=
State target majority sample number;
Obtain the number f of current deleted most samples;
When the f is less than the e, to the carry out selectively removing of the stable sample;
Most sample sets are added in not deleted stable sample.
The processing unit 8. a kind of adaptively sampled unbalanced data is classified, which is characterized in that including:
Acquisition module, for obtaining target majority sample number and target minority sample number;
Processing module is used for according to the target majority sample number and the target minority sample number to pending imbalance
Data carry out adaptively sampled data processing, so that described in most sample numbers satisfaction in the pending unbalanced data
Target majority sample number, a small number of sample numbers in the pending unbalanced data meet the target minority sample
Number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.
9. a kind of adaptively sampled unbalanced data sort processing device, including processor, memory and it is stored in described deposit
In reservoir and it is configured as the computer program executed by the processor, the processor executes real when the computer program
Now adaptively sampled unbalanced data classification processing method as claimed in any of claims 1 to 7 in one of claims.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage
Machine program, wherein equipment where controlling the computer readable storage medium when the computer program is run is executed as weighed
Profit requires the adaptively sampled unbalanced data classification processing method described in any one of 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810453102.7A CN108694413A (en) | 2018-05-10 | 2018-05-10 | Adaptively sampled unbalanced data classification processing method, device, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810453102.7A CN108694413A (en) | 2018-05-10 | 2018-05-10 | Adaptively sampled unbalanced data classification processing method, device, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108694413A true CN108694413A (en) | 2018-10-23 |
Family
ID=63847485
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810453102.7A Pending CN108694413A (en) | 2018-05-10 | 2018-05-10 | Adaptively sampled unbalanced data classification processing method, device, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108694413A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460872A (en) * | 2018-11-14 | 2019-03-12 | 重庆邮电大学 | One kind being lost unbalanced data prediction technique towards mobile communication subscriber |
CN109740750A (en) * | 2018-12-17 | 2019-05-10 | 北京深极智能科技有限公司 | Method of data capture and device |
CN109886337A (en) * | 2019-02-22 | 2019-06-14 | 清华大学 | Based on adaptively sampled depth measure learning method and system |
CN110045197A (en) * | 2019-02-27 | 2019-07-23 | 国网福建省电力有限公司 | A kind of Distribution Network Failure method for early warning |
-
2018
- 2018-05-10 CN CN201810453102.7A patent/CN108694413A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109460872A (en) * | 2018-11-14 | 2019-03-12 | 重庆邮电大学 | One kind being lost unbalanced data prediction technique towards mobile communication subscriber |
CN109460872B (en) * | 2018-11-14 | 2021-11-16 | 重庆邮电大学 | Mobile communication user loss imbalance data prediction method |
CN109740750A (en) * | 2018-12-17 | 2019-05-10 | 北京深极智能科技有限公司 | Method of data capture and device |
CN109886337A (en) * | 2019-02-22 | 2019-06-14 | 清华大学 | Based on adaptively sampled depth measure learning method and system |
CN109886337B (en) * | 2019-02-22 | 2021-09-14 | 清华大学 | Depth measurement learning method and system based on self-adaptive sampling |
CN110045197A (en) * | 2019-02-27 | 2019-07-23 | 国网福建省电力有限公司 | A kind of Distribution Network Failure method for early warning |
CN110045197B (en) * | 2019-02-27 | 2022-12-13 | 国网福建省电力有限公司 | Distribution network fault early warning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108694413A (en) | Adaptively sampled unbalanced data classification processing method, device, equipment and medium | |
CN108647727A (en) | Unbalanced data classification lack sampling method, apparatus, equipment and medium | |
CN106156092B (en) | Data processing method and device | |
CN107193750A (en) | A kind of script method for recording and device | |
CN109409528A (en) | Model generating method, device, computer equipment and storage medium | |
CN108491474A (en) | A kind of data classification method, device, equipment and computer readable storage medium | |
CN109388675A (en) | Data analysing method, device, computer equipment and storage medium | |
KR20190075962A (en) | Data processing method and data processing apparatus | |
CN106803799B (en) | Performance test method and device | |
CN107622326A (en) | User's classification, available resources Forecasting Methodology, device and equipment | |
CN107656966A (en) | The method and server of a kind of processing data | |
CN108520471A (en) | It is overlapped community discovery method, device, equipment and storage medium | |
CN107507036A (en) | The method and terminal of a kind of data prediction | |
CN110097170A (en) | Information pushes object prediction model acquisition methods, terminal and storage medium | |
CN107908796A (en) | E-Government duplicate checking method, apparatus and computer-readable recording medium | |
CN115660711A (en) | User ID generation method and device, electronic equipment and readable storage medium | |
CN110298508A (en) | Behavior prediction method, device and equipment | |
CN109978575B (en) | Method and device for mining user flow operation scene | |
CN108647728B (en) | Unbalanced data classification oversampler method, device, equipment and medium | |
CN109255676A (en) | Method of Commodity Recommendation, device, computer equipment and storage medium | |
CN106447397A (en) | Tobacco retail customer pricing method based on decision tree algorithm | |
CN109767333A (en) | Select based method, device, electronic equipment and computer readable storage medium | |
CN106066966B (en) | Frozen application display method, frozen application display device and terminal | |
CN107645583A (en) | A kind of contact sequencing method, mobile terminal and computer-readable recording medium | |
CN111124209A (en) | Interface display adjustment method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181023 |