CN108694413A

CN108694413A - Adaptively sampled unbalanced data classification processing method, device, equipment and medium

Info

Publication number: CN108694413A
Application number: CN201810453102.7A
Authority: CN
Inventors: 韩伟红; 李树栋; 王乐; 方滨兴; 贾焰; 黄子中; 周斌; 殷丽华; 田志宏
Original assignee: Guangzhou University
Current assignee: Guangzhou University
Priority date: 2018-05-10
Filing date: 2018-05-10
Publication date: 2018-10-23

Abstract

The invention discloses a kind of adaptively sampled unbalanced data classification processing methods, including:Obtain target majority sample number and target minority sample number;Adaptively sampled data processing is carried out to pending unbalanced data according to the target majority sample number and the target minority sample number, so that most sample numbers in treated the pending unbalanced data meet the target majority sample number, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.According to the demands of individuals of user, over-sampling and Undersampling technique is used in combination so that newly-generated sample set meets the needs of sorting algorithm is to data, improves the classification accuracy of uneven big data.

Description

Adaptively sampled unbalanced data classification processing method, device, equipment and medium

Technical field

The present invention relates to uneven big data processing field more particularly to adaptively sampled unbalanced data classification processing sides Method, device, equipment and medium.

Background technology

With being constantly progressive for technology, including interconnection net spee is promoted, mobile Internet updates, hardware technology is continuous Development, data acquisition technology, memory technology, treatment technology obtain significant progress, and data just increase at an unprecedented rate, We have come into the big data epoch.The data scale huge (volume) of big data generates high speed (velocity), form Various (variety), data do not know characteristics such as (veracity) and traditional data analysis and digging technology are being applied to Unprecedented challenge is encountered when big data field.

Data classification be data analysis and excavate in rudimentary algorithm, have a wide range of applications field and a lot of other The basis of data analysis and mining algorithm.In big data, almost all of data set is all unbalanced data, unbalanced data Refer to that at least one classification includes relatively other less samples of classification in data set.Data nonbalance problem is in real generation It is widely present in boundary, especially in big data application field.For example, in internet text classification, the data of each classification are not Balanced, and the often other data of group that we pay close attention to, such as the sensitive information on network, emerging topic etc.;In electricity In sub- business application, a large amount of customer transaction data and behavioral data are all normal, and the often electronics quotient that we pay close attention to Fraud in business and abnormal behaviour, these data are submerged in a large amount of normal behaviour data, belong to knockdown Unbalanced dataset.Similar application also has medical diagnosis, Satellite Remote Sensing Data Classification etc..Therefore, uneven big data classification It is key technical problem in the urgent need to address in national economy and social development, is with a wide range of applications.

Uneven big data leads to traditional classification learning algorithm since the quantitative difference of different classes of data sample is excessive It is difficult the classifying quality obtained, unbalanced data in the prior art as shown in Figure 1 classification example, wherein circle are minority class Sample, triangle are most class samples, and imbalance is than being 3:1, i.e., most class samples are 3 times of minority class sample, and actual In large data sets, imbalance is than often 10000:1, it is even higher, therefore first need to carry out data before being classified Pretreatment.

Existing imbalance big data preprocess method includes mainly for the over-sampling of minority class and for most classes Lack sampling.Over-sampling refers to increasing minority class sample using certain methods and techniques, and lack sampling refers to using certain method Most class samples are reduced with technology, the purpose of both methods is all the injustice by reducing large data sets to the adjustment of sample set Weighing apparatus degree increases the accuracy of sorting algorithm.

Inventor has found that there are following technical problems for the prior art when implementing the embodiment of the present invention:Different classifications algorithm And different application demand is more different than demand to the size of unbalanced dataset and the imbalance of data, can increase in over-sampling Add the scale of training set, especially when original training set it is uneven than it is very big when, can obtain close to most class sample numbers New synthesis minority class sample.It is assumed that original training is concentrated with 100 minority class samples, 10000 most class samples then need 9900 new minority class samples are synthesized using over-sampling so that final training examples number significantly increases, and on the one hand synthesizes Sample excessively causes newly to synthesize sample largely to be repeated with existing sample, and the increase of another aspect data volume can reduce sorting algorithm Performance.Although the scale of data can be reduced in Undersampling technique, especially when original training set it is uneven than it is very big when, Data scale after lack sampling can greatly reduce, but in order to reach balance and excessive reduction majority class sample may also cause With the loss of information, it is also possible to lead to the significant decrease of most class classifying qualities.

Invention content

In view of the above-mentioned problems, the purpose of the present invention is to provide a kind of adaptively sampled unbalanced data classification processing sides Over-sampling and Undersampling technique is used in combination according to the demand data of sorting algorithm in method so that newly-generated sample set, which meets, to be divided Demand of the class algorithm to data improves the classification accuracy of uneven big data.

In a first aspect, the present invention provides a kind of adaptively sampled unbalanced data classification processing methods, including:

Obtain target majority sample number and target minority sample number;

Pending unbalanced data is carried out according to the target majority sample number and the target minority sample number Adaptively sampled data processing, so that described in most sample numbers in treated the pending unbalanced data meet Target majority sample number, it is a small number of that a small number of sample numbers in treated the pending unbalanced data meet the target Sample number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.

It is described according to the target majority sample number and the mesh in the first possible realization method of first aspect The a small number of sample numbers of mark carry out adaptively sampled data processing to pending unbalanced data, so that the pending imbalance Most sample numbers in data meet the target majority sample number, a small number of samples in the pending unbalanced data Number meets the target minority sample number:

When a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample number, root According to the number and minority class sample of most class samples in the k neighbours of each a small number of samples in the pending unbalanced data Number determines the classification of corresponding a small number of samples;Wherein, the classification include noise sample, unstable sample, boundary sample and Stablize sample;

The corresponding operation of the classification is carried out according to the classification of a small number of samples;Wherein, the operation includes deleting, protecting It stays, replicate or synthesizes;

When most sample numbers in the pending unbalanced data are unsatisfactory for the target majority sample number, root According to the number and minority class sample of most class samples in the k neighbours of each most samples in the pending unbalanced data Number determines the classification of corresponding most samples;Wherein, the classification includes noise sample, boundary sample and stable sample;

The corresponding operation of the classification is carried out according to the classification of each most samples;Wherein, the operation includes deleting It removes, retain and selectively removing;

Obtain final a small number of sample sets and final most sample sets, wherein final a small number of sample set numbers Meet the target minority sample number, final most sample set numbers meet the target majority sample number.

The possible realization method of with reference to first aspect the first, in second of possible realization method of first aspect, institute It states and includes according to the corresponding operation of the classification progress classification of a small number of samples:

When a small number of samples are the noise sample, a small number of samples are deleted;

When a small number of samples are the unstable sample, a small number of samples are added in a small number of sample sets, still New sample is not replicated or generated to it, updates a small number of sample set numbers;

When a small number of samples are the boundary sample, according to reproduction ratio c=(the target minority sample number-institutes State the number of unstable sample)/(number-of a small number of samples-noise sample in the pending unbalanced data is described The number of unstable sample) a small number of samples are replicated, to obtain replicating sample, by the duplication sample and described few Number sample is added in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is the reproduction ratio to replicate number The absolute value of the difference that c subtracts one;

When a small number of samples are the stable sample, by neighbours' sample of a small number of samples and a small number of samples It is synthesized, to obtain synthesis sample, the synthesis sample and a small number of samples is added in a small number of sample sets, and more New a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, the duplication Than c=(number of the target minority sample number-unstable sample)/(few in the pending unbalanced data The number of the number-unstable sample of the number sample-noise sample).

The possible realization method of second with reference to first aspect, in the third possible realization method of first aspect, institute It states and further includes according to the corresponding operation of the classification progress classification of a small number of samples:

Detect that calculating lacks number d=institutes when having traversed each minority sample in the pending unbalanced data State the presently described a small number of sample set numbers of target minority sample number-;

The stable sample that corresponding number is randomly choosed according to the missing number d, by the neighbours of the stable sample Sample is synthesized with the stable sample, and to obtain newly synthesizing sample, the new synthesis sample and the stable sample are added Enter in a small number of sample sets.

The possible realization method of with reference to first aspect the first, in the 4th kind of possible realization method of first aspect, institute It states and includes according to the corresponding operation of the classification progress classification of each most samples:

When most samples are the noise sample, most samples are deleted;

When most samples are the boundary sample, retain the boundary sample, most samples is added more Number sample set;

When most samples are the stable sample, set according to the distance value of most sample to the surrounding k nearest neighbors Determine probability of erasure and carry out selectively removing, most sample sets are added in not deleted most samples.

The 4th kind of possible realization method with reference to first aspect, in the 5th kind of possible realization method of first aspect, institute State the deletion set according to the distance of most samples to the surrounding k nearest neighbor when most samples are the stable sample Probability carries out selectively removing, and most sample sets, which are added, in not deleted most samples includes:

The distance d values that most sample arrives the k neighbours sample are calculated, with according to the setting of the size of the distance d values The probability of erasure of the stable sample;

When detecting that the probability of erasure is greater than or equal to preset value, then the stable sample is deleted;Wherein, it is described away from Smaller from d, then the probability of erasure is bigger.

When detecting that the probability of erasure is less than preset value, then retain the stable sample, most samples are added Enter most sample sets;Wherein, the distance d is bigger, then the probability of erasure is smaller.

The 5th kind of possible realization method with reference to first aspect, in the 6th kind of possible realization method of first aspect,

It is described when most samples are the stable sample, according to the distance of most samples to surrounding k nearest neighbor The probability of erasure of setting carries out selectively removing, and most sample sets, which are added, in not deleted most samples further includes:

Obtain for deleting most sample numbers-noise sample in pending unbalanced data described in number e= Number-target majority sample the number;

Obtain the number f of current deleted most samples;

When the f is less than the e, to the carry out selectively removing of the stable sample;

Most sample sets are added in not deleted stable sample.

Second aspect, the present invention also provides adaptively sampled unbalanced data classification processing units, including:

Acquisition module, for obtaining target majority sample number and target minority sample number;

Processing module, for according to the target majority sample number and the target minority sample number to it is pending not Equilibrium criterion carries out adaptively sampled data processing, so that most sample numbers in the pending unbalanced data meet The target majority sample number, a small number of sample numbers in the pending unbalanced data meet the target minority sample Number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.

The third aspect, the embodiment of the present invention additionally provide a kind of adaptively sampled unbalanced data sort processing device, packet It includes processor, memory and is stored in the memory and is configured as the computer program executed by the processor, The processor realizes the adaptively sampled unbalanced data point as described in above-mentioned any one when executing the computer program Class processing method.

Fourth aspect, the embodiment of the present invention additionally provide a kind of computer readable storage medium, described computer-readable to deposit Storage media includes the computer program of storage, wherein the computer-readable storage is controlled when the computer program is run Equipment where medium executes the adaptively sampled unbalanced data classification processing method described in above-mentioned any one.

Above-mentioned technical proposal has the following advantages that:Obtain target majority sample number and target minority sample number;According to The target majority sample number and the target minority sample number carry out adaptively sampled number to pending unbalanced data According to processing, so that most sample numbers in treated the pending unbalanced data meet the target majority sample Number, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample number;Its In, the adaptively sampled data processing includes over-sampling and lack sampling, realizes and is met according to user demand adaptive generation The sample set of demand allows user to input the total number of samples needed and intentionally gets the uneven ratio of data set, according to The adaptive combined use over-sampling of family demand and lack sampling method, while over-sampling is carried out to minority class sample, to most classes Sample carries out lack sampling, ultimately generates the sample set for meeting user demand, effectively improves the classification accuracy of uneven big data.

Description of the drawings

Fig. 1 be in the prior art unbalanced data classification exemplary plot;

The adaptively sampled unbalanced data classification processing method flow diagram that Fig. 2 first embodiment of the invention provides;

Fig. 3 is the adaptively sampled unbalanced data classification processing method flow signal that second embodiment of the invention provides Figure;

Fig. 4 is a small number of sample k nearest neighbor schematic diagrames in pending unbalanced data;

Fig. 5 is the adaptively sampled unbalanced data classification processing method flow signal that third embodiment of the invention provides Figure;

Fig. 6 is the adaptively sampled unbalanced data classification processing method flow signal that fourth embodiment of the invention provides Figure;

Fig. 7 is that a kind of adaptively sampled unbalanced data classification processing unit structure that fifth embodiment of the invention provides is shown It is intended to;

Fig. 8 is the structural representation for the adaptively sampled unbalanced data sort processing device that sixth embodiment of the invention provides Figure.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.

Embodiment one

Referring to Fig. 2, the adaptively sampled unbalanced data classification processing method flow that first embodiment of the invention provides is shown It is intended to.

It should be noted that in uneven big data preprocessing process, in existing method certainly not according to user demand The size and equilibrium ratio for the sample set for adapting to generate the sample set of meet demand, therefore obtaining cannot all meet actual demand, And it is to increase the quantity of minority class sample, or it is to delete majority to use lack sampling method that oversampler method is used in existing method Class sample is single accurate using oversampler method or single classification that cannot all effectively improve uneven big data using lack sampling method Property.The number of samples that existing lack sampling method is deleted is limited, effective to general unbalanced dataset, but is difficult to meet not Balance the classification demand of large data sets.

It should be noted that the adaptively sampled unbalanced data classification processing method be before classification to data into Capable advanced processing is used for data prediction.

Adaptively sampled unbalanced data classification processing method provided in this embodiment can be executed by terminal device, described Terminal device includes but not limited to:Mobile phone, laptop, tablet computer and desktop computer etc..

The adaptively sampled unbalanced data classification processing method is as follows:

S11, target majority sample number and target minority sample number are obtained.

In the present embodiment, the terminal device can obtain default sample set value input by user and the default injustice Weigh data sample ratio, and it is more then to respectively obtain target according to the default sample set value and the default unbalanced data sample ratio Number sample number and target minority sample number, wherein the terminal device receives the default sample set input by user Value and the default unbalanced data sample ratio, or directly acquire the default sample set for the acquiescence that the terminal device preserves Value and the default unbalanced data sample ratio, the present invention are not especially limited this.Wherein, the default sample set value is to use The numerical values recited for the sample set that family is wanted after handling pending unbalanced data, the default sample set include more Number sample and a small number of samples, then the default sample set value is the number of the number and a small number of samples of most samples.Wherein, institute Default unbalanced data sample is stated than for the number of most samples described in the default sample set and a small number of samples Several ratio, the default unbalanced data sample ratio can be most sample numbers/a small number of sample numbers, or The minority sample number/most sample numbers, the present invention are not especially limited this.

Specifically, having acquired the default sample set value input by user and the default unbalanced data sample Than after, it is assumed that the default sample set value is X, the default unbalanced data sample ratio=majority sample ratio/minority sample ratio =a/b, then the target majority sample number A=X*a/ (a+b), the target minority sample number B=X*b/ (a+b).

In the present embodiment, it can directly acquire to obtain the target majority sample number and described by the terminal device Target minority sample number provides input interface by the terminal device, the majority is directly acquired by the input interface Sample number and the target minority sample number, or be to directly acquire to pre-save numerical value, acquiescence chooses the majority Sample number and the target minority sample number.

S12, according to the target majority sample number and the target minority sample number to pending unbalanced data Adaptively sampled data processing is carried out, so that most sample numbers in treated the pending unbalanced data meet The target majority sample number, a small number of sample numbers in treated the pending unbalanced data meet the target A small number of sample numbers;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.

In embodiments of the present invention, most sample numbers in the pending unbalanced data are unsatisfactory for the target When most sample numbers, lack sampling processing is carried out to most samples in the pending unbalanced data, described pending When a small number of sample numbers in unbalanced data are unsatisfactory for the target minority sample number, to the pending unbalanced data In a small number of samples carry out over-sampling processing.

It should be noted that described in most sample numbers in treated the pending unbalanced data meet It is more to meet the target for most sample numbers in treated the pending unbalanced data for target majority sample number Most sample numbers in the error range of number sample numbers or treated the pending unbalanced data with it is described Target majority sample number is consistent, described in a small number of sample numbers in treated the pending unbalanced data meet It is few to meet the target for a small number of sample numbers in treated the pending unbalanced data for target minority sample number A small number of sample numbers in the error range of number sample numbers or treated the pending unbalanced data with it is described Target minority sample number is consistent, and the present invention is not especially limited this.

In embodiments of the present invention, there are following four situation, one, most samples in the pending unbalanced data Number is unsatisfactory for the target majority sample number, and a small number of sample numbers in the pending unbalanced data meet the mesh The a small number of sample numbers of mark, then only carry out lack sampling processing to most samples in the pending unbalanced data;Two, described to wait for A small number of sample numbers in processing unbalanced data are unsatisfactory for the target minority sample number, the pending unbalanced data In most sample numbers meet the target majority sample number, then only to a small number of samples in the pending unbalanced data Example carries out over-sampling processing;Three, most samples in the pending unbalanced data are all unsatisfactory for corresponding with a small number of samples The requirement of target number then respectively carries out at lack sampling most samples in the pending unbalanced data with a small number of samples Reason, over-sampling processing;Four, most samples in the pending unbalanced data all meet corresponding target with a small number of samples Number requires, then without operation, the condition of satisfaction can be arranged at this time, be no longer to meet error range, to improve uneven big number According to classification adaptability, more meet user demand.

Implement the present embodiment to have the advantages that:

Obtain target majority sample number and target minority sample number;According to the target majority sample number and described Target minority sample number carries out adaptively sampled data processing to pending unbalanced data, so that being waited for described in treated Most sample numbers in processing unbalanced data meet the target majority sample number, the pending injustice that treated A small number of sample numbers in weighing apparatus data meet the target minority sample number;Wherein, the adaptively sampled data processing packet Over-sampling and lack sampling are included, the sample set according to user demand adaptive generation meet demand is realized, allowing user to input needs Total number of samples for wanting and the uneven ratio for intentionally getting data set, according to the adaptive combined use over-sampling of user demand With lack sampling method, while to minority class sample carry out over-sampling, to most class samples carry out lack sampling, ultimately generate satisfaction use The sample set of family demand effectively improves the classification accuracy of uneven big data, realizes in combination with using Undersampling technique And oversampling technique increases the accuracy of sorting algorithm by reducing the degree of unbalancedness of large data sets to the adjustment of sample set.

Embodiment two

Referring to Fig. 3, the adaptively sampled unbalanced data classification processing method flow that second embodiment of the invention provides is shown It is intended to.

It is described according to the target majority sample number and the target minority sample number to pending unbalanced data Adaptively sampled data processing is carried out, so that most sample numbers in treated the pending unbalanced data meet The target majority sample number, a small number of sample numbers in treated the pending unbalanced data meet the target A small number of sample numbers include:

S21, a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample number When, according to the number and minority class of most class samples in the k neighbours of each a small number of samples in the pending unbalanced data The number of sample determines the classification of corresponding a small number of samples;Wherein, the classification includes noise sample, unstable sample, boundary Sample and stablize sample;

It should be noted that the k values are more than 1, and it is integer, the k values, of the invention determines according to actual conditions This is not especially limited.

It should be noted that when the overwhelming majority is most samples in neighbours' sample of a small number of samples, i.e., neighbours save When most samples of point reach default value, then it is assumed that the minority sample is the noise sample;In a small number of samples Most sample numbers are more relative to a small number of sample numbers in neighbours' sample, then it is assumed that the minority sample is the unstable random sample Example;Most sample numbers are suitable with a small number of sample numbers in neighbours' sample of a small number of samples, without significant difference, then recognize It is the boundary sample for a small number of samples;Most sample numbers are than a small number of samples in neighbours' sample of a small number of samples Number is few, then it is assumed that the minority sample is the stable sample.

It should be noted that the error range of the comparison of each classification can be set according to actual conditions, for example, described When most sample numbers differ less than -2 with a small number of sample numbers in neighbours' sample of a small number of samples, it is believed that the minority sample Neighbours' sample in most sample numbers it is suitable with minority sample numbers;When most class samples in the neighbours are more than 2/3, then Think that a small number of samples are the noise sample;Wherein, the error range may be set according to actual conditions, the present invention couple This is not especially limited.

Specifically, a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample When number, the number and minority of most class samples in the k neighbours of each a small number of samples in the pending unbalanced data are calculated The number of class sample, general k value are set within 10, if k is set as 8, just checking in 8 neighbours has several minority class, several A majority class, for example, 6 or more most class neighbours, then it is assumed that this minority class sample is noise, 5-6 most class, then for not Stablize sample, 4, be boundary sample, 3 or less be to stablize sample.Referring to Fig. 4, most samples are represented with triangle shape, with circle Represent a small number of samples, it is assumed that the K values are 6, and a small number of sample J of one in the pending unbalanced data then compare described In 6 close samples of a small number of sample J necks, how many is a for most samples, a small number of samples how many again, and the minority sample J necks are closely 6 samples lived with circle circle, wherein most samples have 6, and a small number of samples have 0, then illustrate the neighbour of a small number of sample J It is most samples to occupy the overwhelming majority in sample, then it is assumed that the minority sample J is noise sample;With the pending uneven number For an a small number of sample M in, it is preferable that k values in primary pretreatment are identical, and the k values are 6, then described Most sample numbers are 4 in neighbours' sample of a small number of sample M, and a small number of sample numbers are 2, then neighbours' sample of a small number of sample M Most sample numbers are more than a small number of sample numbers in example, then a small number of sample M are the stable sample.

It should be noted that k neighbour's samples of each a small number of samples in the pending unbalanced data are exactly from institute State K nearest sample of a small number of samples, with reference to figure 4 it is found that enclose minority sample J with the circle of a small number of sample M be not as it is big It is small, thus illustrate, in k sample of the arest neighbors for obtaining a small number of samples.

S22, the corresponding operation of the classification is carried out according to the classification of a small number of samples;Wherein, the operation includes deleting It removes, retain, replicate or synthesizes;

In embodiments of the present invention, judged according to the classification of a small number of samples, judge the valence of a small number of samples Value, for example, when the processing of certain unbalanced data needs to reject noise sample, when reducing noise effect, by a small number of samples It is deleted for the sample of noise;It needs to pay close attention to boundary sample in the processing of certain unbalanced data, then by a small number of samples It is remained for boundary sample, and duplication operation is carried out to the boundary sample, the present invention is not especially limited this.

S23, most sample numbers in the pending unbalanced data are unsatisfactory for the target majority sample number When, according to the number and minority class of most class samples in the k neighbours of each most samples in the pending unbalanced data The number of sample determines the classification of corresponding most samples;Wherein, the classification includes noise sample, boundary sample and stable sample Example;

In embodiments of the present invention, the classification discrimination principles of most samples and the classification of above-mentioned a small number of samples differentiate former Manage identical, details are not described herein.

S24, the corresponding operation of the classification is carried out according to the classification of each most samples;Wherein, the operation packet Include deletion, reservation and selectively removing;

In embodiments of the present invention, the classification according to each most samples carries out the corresponding operation of the classification Principle is identical with above-mentioned a small number of samples, is all to be operated accordingly according to the value function of most samples, herein not It repeats again.

S25, final a small number of sample sets and final most sample sets are obtained, wherein final a small number of sample sets Number meets the target minority sample number, and final most sample set numbers meet the target majority sample Number.

It should be noted that in the present embodiment, to a small number of samples and majority sample in the pending unbalanced data The sequencing that the classification of example determines is not especially limited.

In the present embodiment, in the pending unbalanced data most samples and a small number of sample all carried out phase After the operation answered, finally obtained a small number of sample sets and finally obtained most sample sets are obtained;For example, waiting locating to described It manages all most samples in unbalanced data and the corresponding behaviour of the classification is all carried out according to the classification of each most samples After work, i.e., each most samples have all carried out deleting, after reservation or the operation of selectively removing one, obtain last Obtained most sample sets.

Implement the present embodiment to have the advantages that:

In the prior art, when carrying out over-sampling to a small number of samples and carrying out lack sampling to most samples, to all samples Example is all uniformly processed, and is not distinguished to a small number of samples and most samples;By comparing the pending unbalanced data In each most samples k neighbours in the number of most class samples and the number of minority class sample determine corresponding most samples Classification, a small number of samples are divided into noise sample, boundary sample and stablize sample, different classes of different disposal, effectively The quality of the later a small number of sample sets of uneven big data over-sampling is improved, and then improves the accuracy of sorting algorithm;Pass through ratio The number of most class samples and minority class sample in the k neighbours of each most samples in the pending unbalanced data Number determines the classification of corresponding most samples, and most samples are divided into noise sample, boundary sample and stablize sample, Different classes of different disposal effectively improves the quality of the later most sample sets of uneven big data lack sampling, and then improves and divide The accuracy of class algorithm.

Embodiment three

On the basis of embodiment two, the classification correspondence is carried out for step S22, according to the classification of a small number of samples Operation describe in detail:

Referring to Fig. 5, the adaptively sampled unbalanced data classification processing method flow that third embodiment of the invention provides is shown It is intended to.

The classification according to a small number of samples carries out the corresponding operation of the classification:

Preferably, described to further include according to the corresponding operation of the classification progress classification of a small number of samples:

In embodiments of the present invention, each minority sample carries out corresponding operating in the pending unbalanced data Afterwards, to a small number of samples into line flag, and the number of a small number of samples indicated is recorded, in a small number of samples indicated Number when reaching the number of a small number of samples in the pending unbalanced data, then traversed the pending imbalance Each minority sample, the present invention do not make this tool and limit in data.

The stable sample that corresponding number is randomly choosed according to the missing number d, by the neighbours of the stable sample Sample is synthesized with the stable sample, and to obtain newly synthesizing sample, a small number of samples are added in the new synthesis sample It concentrates.

It should be noted that the d values are positive integer, when the d is less than 1, then the new synthesis sample is no longer synthesized Example.

Specifically, judging the classification of each minority sample in the pending unbalanced data, it is in a small number of samples When the noise sample, a small number of samples are deleted;When a small number of samples are the unstable sample, by the minority Sample is added in a small number of sample sets, but is not replicated or generated new sample to it, updates a small number of sample set numbers; When a small number of samples are the boundary sample, according to reproduction ratio c=, (the target minority sample number-is described unstable The number of sample)/(number-unstable random sample of a small number of samples-noise sample in the pending unbalanced data The number of example) a small number of samples are replicated, to obtain replicating sample, the duplication sample and a small number of samples are added Enter in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is what the reproduction ratio c subtracted one to replicate number Absolute value of the difference;When a small number of samples are the stable sample, by neighbours' sample of a small number of samples and the minority Sample is synthesized, and to obtain synthesis sample, the synthesis sample and a small number of samples are added in a small number of sample sets, And update a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, described Reproduction ratio c=(number of the target minority sample number-unstable sample)/(in the pending unbalanced data A small number of samples-noise sample the number-unstable sample number), to the pending unbalanced data In an a small number of samples carry out corresponding operatings after, detect whether to have had stepped through the institute in the pending unbalanced data There are a small number of samples, for example, after one in the pending unbalanced data a small number of samples are deleted, detection current time is The no all a small number of samples having had stepped through in the pending unbalanced data;It detects and has traversed the pending injustice In the data that weigh when each a small number of samples, the presently described a small number of sample sets of target minority sample number-described in missing number d=are calculated Number;The stable sample that corresponding number is randomly choosed according to the missing number d, by neighbours' sample of the stable sample It is synthesized with the stable sample, to obtain newly synthesizing sample, institute is added in the new synthesis sample and the stable sample It states in a small number of sample sets.

Implement the present embodiment to have the advantages that:

Classification is accurately divided to a small number of samples in the pending unbalanced data, according in pending unbalanced data A small number of sample k neighbours in most class samples number, minority class sample is divided into noise sample, unstable sample, boundary Sample and stablize sample, it is different classes of to handle respectively, the matter of minority class sample after uneven big data over-sampling is improved with this Amount, and then improve the accuracy of uneven big data sorting algorithm.

Example IV

On the basis of embodiment two, the classification is carried out for step S24, according to the classification of each most samples Corresponding operation describes in detail:

Referring to Fig. 6, the adaptively sampled unbalanced data classification processing method flow that fourth embodiment of the invention provides is shown It is intended to.

The classification according to each most samples carries out the corresponding operation of the classification:

When most samples are the noise sample, most samples are deleted;

Preferably, described when most samples are the stable sample, according to most samples to surrounding k nearest neighbor The probability of erasure of distance setting carry out selectively removing, most sample sets, which are added, in not deleted most samples includes:

Preferably, described when most samples are the stable sample, according to most samples to surrounding k nearest neighbor The probability of erasure of distance setting carry out selectively removing, most sample sets, which are added, in not deleted most samples also wraps It includes:

Obtain the number f of current deleted most samples;

Most sample sets are added in not deleted stable sample.

Specifically, judging the classification of each majority sample in the pending unbalanced data, it is in most samples When the noise sample, most samples are deleted;When most samples are the boundary sample, retain the boundary sample Most sample sets are added in most samples by example;When most samples are the stable sample, according to most samples The distance value setting probability of erasure of example to surrounding k nearest neighbor carries out selectively removing, not deleted most samples is added described more Number sample sets, that is, calculate most sample to the k neighbours sample distance d values, to be set according to the size of the distance d values The probability of erasure of the fixed stable sample;When detecting that the probability of erasure is greater than or equal to preset value, then delete described steady Random sample example;Wherein, the distance d is smaller, then the probability of erasure is bigger, is detecting the probability of erasure less than preset value When, then retain the stable sample, most sample sets are added in most samples;Wherein, the distance d is bigger, then The probability of erasure is smaller.Most samples each to the pending unbalanced data carry out after operating accordingly, and acquisition is deleted Except the number-of most sample numbers-noise sample in pending unbalanced data described in the number e=target is most Sample number;Obtain the number f of current deleted most samples;When the f is less than the e, to the stable sample Carry out selectively removing;Most sample sets are added in not deleted stable sample.

Implement the present embodiment to have the advantages that:

Classification is accurately divided to most samples in the pending unbalanced data, according in pending unbalanced data Most sample k neighbours in minority class sample number, majority class samples divide into noise sample, boundary sample and stablize sample Example, it is different classes of to handle respectively, the quality of the later most class samples of uneven big data lack sampling is improved with this, and then improve The classification accuracy of uneven big data.

It is a kind of adaptively sampled unbalanced data classification processing dress that fifth embodiment of the invention provides referring to Fig. 7, Fig. 7 Structural schematic diagram is set, including:

Acquisition module 71, for obtaining target majority sample number and target minority sample number;

Processing module 72 is used for according to the target majority sample number and the target minority sample number to pending Unbalanced data carries out adaptively sampled data processing, so that most sample numbers in the pending unbalanced data are full The foot target majority sample number, a small number of sample numbers in the pending unbalanced data meet the target minority sample Example number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.

Preferably, the processing module 71 includes:

A small number of sample classification determination units are unsatisfactory for for a small number of sample numbers in the pending unbalanced data When the target minority sample number, according to most in the k neighbours of each a small number of samples in the pending unbalanced data The number of class sample and the number of minority class sample determine the classification of corresponding a small number of samples;Wherein, the classification includes noise Sample, unstable sample, boundary sample and stable sample;

A small number of operating units, for carrying out the corresponding operation of the classification according to the classification of a small number of samples;Wherein, institute It includes deleting, retain, replicate or synthesizing to state operation;

Most sample classification determination units are unsatisfactory for for most sample numbers in the pending unbalanced data When the target majority sample number, according to most in the k neighbours of each most samples in the pending unbalanced data The number of class sample and the number of minority class sample determine the classification of corresponding most samples;Wherein, the classification includes noise Sample, boundary sample and stable sample;

Majority operation unit, for carrying out the corresponding operation of the classification according to the classification of each most samples;Its In, the operation includes deletion, reservation and selectively removing;

Sample set acquiring unit, for obtaining final a small number of sample sets and final most sample sets, wherein it is described most Whole a small number of sample set numbers meet the target minority sample number, and final most sample set numbers meet the mesh The most sample numbers of mark.

Preferably, the minority class operating unit includes:

Preferably, a small number of operating units further include:

Detection unit calculates when having traversed each minority sample in the pending unbalanced data for detecting Lack the presently described a small number of sample set numbers of target minority sample number-described in number d=;

Synthesis unit, the stable sample for randomly choosing corresponding number according to the missing number d will be described steady Neighbours' sample of random sample example is synthesized with the stable sample, to obtain newly synthesizing sample, by the new synthesis sample and institute Stable sample is stated to be added in a small number of sample sets.

Preferably, the majority operation unit includes:

Deleting unit, for when most samples are the noise sample, deleting most samples;

Stick unit will be described more for when most samples are the boundary sample, retaining the boundary sample Most sample sets are added in number sample;

Selectively removing unit, for when most samples are the stable sample, being arrived according to most samples The distance value setting probability of erasure of surrounding k nearest neighbor carries out selectively removing, and most samples are added in not deleted most samples Example collection.

Preferably, the selectively removing unit includes:

When detecting that the probability of erasure is greater than or equal to preset value, then the stable sample is deleted;Wherein, it is described away from Smaller from d, then the probability of erasure is bigger;

Preferably, the selectively removing unit further includes:

Obtain the number f of current deleted most samples;

Most sample sets are added in not deleted stable sample.

Implement the present embodiment to have the advantages that:

Obtain target majority sample number and target minority sample number;According to the target majority sample number and described Target minority sample number carries out adaptively sampled data processing to pending unbalanced data, so that being waited for described in treated Most sample numbers in processing unbalanced data meet the target majority sample number, the pending injustice that treated A small number of sample numbers in weighing apparatus data meet the target minority sample number;Wherein, the adaptively sampled data processing packet Over-sampling and lack sampling are included, the sample set according to user demand adaptive generation meet demand is realized, allowing user to input needs Total number of samples for wanting and the uneven ratio for intentionally getting data set, according to the adaptive combined use over-sampling of user demand With lack sampling method, while to minority class sample carry out over-sampling, to most class samples carry out lack sampling, ultimately generate satisfaction use The sample set of family demand effectively improves the classification accuracy of uneven big data.

Fig. 8 is referred to, Fig. 8 is the adaptively sampled unbalanced data sort processing device that sixth embodiment of the invention provides Schematic diagram, for executing adaptively sampled unbalanced data classification processing method provided in an embodiment of the present invention, such as Fig. 8 institutes Show, which includes:At least one processor 11, such as CPU, at least one net Network interface 14 or other users interface 13, memory 15, at least one communication bus 12, communication bus 12 is for realizing these Connection communication between component.Wherein, user interface 13 may include optionally USB interface and other standards interface, it is wired Interface.Network interface 14 may include optionally Wi-Fi interface and other wireless interfaces.Memory 15 may include high speed RAM memory, it is also possible to further include non-labile memory (non-volatilememory), a for example, at least disk is deposited Reservoir.Memory 15 can include optionally at least one storage device for being located remotely from aforementioned processor 11.

In some embodiments, memory 15 stores following element, executable modules or data structures, or Their subset or their superset:

Operating system 151, including various system programs, for realizing various basic businesses and hardware based of processing Business;

Program 152.

Specifically, processor 11 executes oneself described in above-described embodiment for calling the program 152 stored in memory 15 Adapt to sampling unbalanced data classification processing method.

Alleged processor can be central processing unit (Central Processing Unit, CPU), can also be it His general processor, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng, the processor is the control centre of the adaptively sampled unbalanced data classification processing method, using various interfaces and The various pieces of the entire adaptively sampled unbalanced data classification processing method of connection.

The memory can be used for storing the computer program and/or module, and the processor is by running or executing Computer program in the memory and/or module are stored, and calls the data being stored in memory, is realized uneven Weighing apparatus data are classified the various functions of pretreated electronic device.The memory can include mainly storing program area and storage data Area, wherein storing program area can storage program area, needed at least one function application program (such as sound-playing function, Text conversion function etc.) etc.;Storage data field can be stored uses created data (such as audio data, text according to mobile phone Word message data etc.) etc..In addition, memory may include high-speed random access memory, can also include non-volatile memories Device, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatibility are solid State memory device.

Wherein, if the module of the adaptively sampled unbalanced data classification is realized in the form of SFU software functional unit simultaneously When sold or used as an independent product, it can be stored in a computer read/write memory medium.Based on such reason Solution, the present invention realize all or part of flow in above-described embodiment method, can also instruct correlation by computer program Hardware complete, the computer program can be stored in a computer readable storage medium, which exists When being executed by processor, it can be achieved that the step of above-mentioned each embodiment of the method.Wherein, the computer program includes computer journey Sequence code, the computer program code can be source code form, object identification code form, executable file or certain intermediate shapes Formula etc..The computer-readable medium may include:Any entity or device, note of the computer program code can be carried Recording medium, USB flash disk, mobile hard disk, magnetic disc, CD, computer storage, read-only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium Deng.It should be noted that the content that the computer-readable medium includes can be real according to legislation in jurisdiction and patent The requirement trampled carries out increase and decrease appropriate, such as in certain jurisdictions, according to legislation and patent practice, computer-readable medium Do not include electric carrier signal and telecommunication signal.

It should be noted that the apparatus embodiments described above are merely exemplary, wherein described be used as separating component The unit of explanation may or may not be physically separated, and the component shown as unit can be or can also It is not physical unit, you can be located at a place, or may be distributed over multiple network units.It can be according to actual It needs that some or all of module therein is selected to achieve the purpose of the solution of this embodiment.In addition, device provided by the invention In embodiment attached drawing, the connection relation between module indicates there is communication connection between them, specifically can be implemented as one or A plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can to understand And implement.

The above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also considered as Protection scope of the present invention.

It should be noted that in the above-described embodiments, all emphasizing particularly on different fields to the description of each embodiment, in some embodiment In the part that is not described in, may refer to the associated description of other embodiment.Secondly, those skilled in the art should also know It knows, embodiment described in this description belongs to preferred embodiment, and involved action and simulation must be that the present invention must Must.

Claims

1. a kind of adaptively sampled unbalanced data classification processing method, which is characterized in that including:

Obtain target majority sample number and target minority sample number;

Pending unbalanced data is carried out according to the target majority sample number and the target minority sample number adaptive Sampled-data processing is answered, so that most sample numbers in treated the pending unbalanced data meet the target Most sample numbers, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample Number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.

2. adaptively sampled unbalanced data classification processing method according to claim 1, which is characterized in that the basis The target majority sample number and the target minority sample number carry out adaptively sampled number to pending unbalanced data According to processing, so that most sample numbers in treated the pending unbalanced data meet the target majority sample Number, a small number of sample numbers in treated the pending unbalanced data meet the target minority sample number packet It includes:

When a small number of sample numbers in the pending unbalanced data are unsatisfactory for the target minority sample number, according to institute State the number of the number and minority class sample of most class samples in the k neighbours of each a small number of samples in pending unbalanced data Determine the classification of corresponding a small number of samples;Wherein, the classification includes noise sample, unstable sample, boundary sample and stabilization Sample;

The corresponding operation of the classification is carried out according to the classification of a small number of samples;Wherein, it is described operation include delete, reservation, It replicates or synthesizes;

When most sample numbers in the pending unbalanced data are unsatisfactory for the target majority sample number, according to institute State the number of the number and minority class sample of most class samples in the k neighbours of each most samples in pending unbalanced data Determine the classification of corresponding most samples;Wherein, the classification includes noise sample, boundary sample and stable sample;

The corresponding operation of the classification is carried out according to the classification of each most samples;Wherein, the operation includes deleting, protecting It stays and selectively removing;

3. adaptively sampled unbalanced data classification processing method according to claim 2, which is characterized in that the basis The classification of the minority sample carries out the corresponding operation of the classification:

When a small number of samples are the unstable sample, a small number of samples are added in a small number of sample sets, but it is not right It is replicated or is generated new sample, updates a small number of sample set numbers;

When a small number of samples are the boundary sample, according to reproduction ratio c=, (the target minority sample number-is described not Stablize the number of sample)/(number-shakiness of a small number of samples-noise sample in the pending unbalanced data The number of random sample example) a small number of samples are replicated, to obtain replicating sample, by the duplication sample and a small number of samples Example is added in a small number of sample sets, and updates a small number of sample set numbers;Wherein, it is that the reproduction ratio c subtracts to replicate number One absolute value of the difference;

When a small number of samples are the stable sample, neighbours' sample of a small number of samples and a small number of samples are carried out Synthesis the synthesis sample and a small number of samples is added in a small number of sample sets, and update institute with obtaining synthesis sample State a small number of sample set numbers;Wherein, synthesis number is the absolute value of the difference that the reproduction ratio c subtracts one;Wherein, the reproduction ratio c =(number of the target minority sample number-unstable sample)/(a small number of samples in the pending unbalanced data The number of the number-unstable sample of example-noise sample).

4. adaptively sampled unbalanced data classification processing method according to claim 3, which is characterized in that the basis The classification of the minority sample carries out the corresponding operation of the classification:

Detect that calculating lacks mesh described in number d=when having traversed each minority sample in the pending unbalanced data The presently described a small number of sample set numbers of a small number of sample numbers-of mark;

The stable sample that corresponding number is randomly choosed according to the missing number d, by neighbours' sample of the stable sample It is synthesized with the stable sample, to obtain newly synthesizing sample, institute is added in the new synthesis sample and the stable sample It states in a small number of sample sets.

5. adaptively sampled unbalanced data classification processing method according to claim 2, which is characterized in that the basis The classification of each most samples carries out the corresponding operation of the classification:

When most samples are the noise sample, most samples are deleted;

When most samples are the boundary sample, retain the boundary sample, most samples are added in most samples Example collection;

When most samples are the stable sample, deleted according to the distance value setting of most samples to surrounding k nearest neighbor Except probability carries out selectively removing, most sample sets are added in not deleted most samples.

6. adaptively sampled unbalanced data classification processing method according to claim 5, which is characterized in that described in institute When to state most samples be the stable sample, according to the probability of erasure of the distance setting of most samples to surrounding k nearest neighbor into Row selectively removing, most sample sets, which are added, in not deleted most samples includes:

The distance d values that most sample arrives the k neighbours sample are calculated, described in being set according to the size of the distance d values Stablize the probability of erasure of sample;

When detecting that the probability of erasure is greater than or equal to preset value, then the stable sample is deleted;Wherein, the distance d Smaller, then the probability of erasure is bigger;

When detecting that the probability of erasure is less than preset value, then retain the stable sample, institute is added in most samples State most sample sets;Wherein, the distance d is bigger, then the probability of erasure is smaller.

7. adaptively sampled unbalanced data classification processing method according to claim 5, which is characterized in that described in institute When to state most samples be the stable sample, according to the probability of erasure of the distance setting of most samples to surrounding k nearest neighbor into Row selectively removing, most sample sets, which are added, in not deleted most samples further includes:

Obtain the number-institute for deleting most sample numbers-noise sample in pending unbalanced data described in number e= State target majority sample number;

Obtain the number f of current deleted most samples;

Most sample sets are added in not deleted stable sample.

The processing unit 8. a kind of adaptively sampled unbalanced data is classified, which is characterized in that including:

Processing module is used for according to the target majority sample number and the target minority sample number to pending imbalance Data carry out adaptively sampled data processing, so that described in most sample numbers satisfaction in the pending unbalanced data Target majority sample number, a small number of sample numbers in the pending unbalanced data meet the target minority sample Number;Wherein, the adaptively sampled data processing includes over-sampling and lack sampling.

9. a kind of adaptively sampled unbalanced data sort processing device, including processor, memory and it is stored in described deposit In reservoir and it is configured as the computer program executed by the processor, the processor executes real when the computer program Now adaptively sampled unbalanced data classification processing method as claimed in any of claims 1 to 7 in one of claims.

10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium includes the calculating of storage Machine program, wherein equipment where controlling the computer readable storage medium when the computer program is run is executed as weighed Profit requires the adaptively sampled unbalanced data classification processing method described in any one of 1 to 7.