CN113792765B

CN113792765B - Oversampling method based on triangle centroid weight

Info

Publication number: CN113792765B
Application number: CN202110976931.5A
Authority: CN
Inventors: 周红芳; 陈佳琳
Original assignee: Wuhan Guanggu Kangfu Information Technology Co ltd
Current assignee: Wuhan Guanggu Kangfu Information Technology Co ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2024-06-14
Anticipated expiration: 2041-08-24
Also published as: CN113792765A

Abstract

The invention discloses an oversampling method based on triangle centroid weight, which comprises the following specific steps: step 1, quantifying a sample to be processed into a numerical value and then calculating a characteristic weight; step 2, carrying out danger-class sample extraction on the quantized samples; step 3, searching a neighbor sample of danger types of samples; step 4, randomly finding out two neighbor samples in the neighbor samples of each danger sample, and calculating the triangle centroid coordinates of three points to obtain centroid samples; step 5, multiplying each feature of the centroid coordinates in the centroid sample by a feature weight to obtain an offset centroid to form a centroid offset sample; step 6: and determining a weight coefficient of the centroid offset sample according to the genetic algorithm, and multiplying the weight coefficient by the offset centroid to finally obtain a new sample. The invention solves the problems that the traditional method adopts a method of adopting a straight line between two points, the space of a new synthesized sample is limited between the two points, and the information extraction of the sample is less.

Description

Oversampling method based on triangle centroid weight

Technical Field

The invention belongs to the technical field of data mining and machine learning data processing, and relates to an oversampling method based on triangle centroid weights.

Background

With the advent of the big data age, a vast diversity of data information has emerged into our lives, with unbalanced data being one of the typical representatives. Unbalanced data refers to unbalanced distribution of data samples among different data categories, and classification problems thereof are commonly existed in the field of artificial intelligence and data mining. Of these two-class or multi-class problems affected by the imbalance problem, we refer to the class with more samples as the majority class or the positive class and the class with less samples as the minority class or the negative class. When classifying, the traditional classification algorithm mainly aims at data samples with balanced data distribution, but when unbalanced data is processed, the classifier becomes inefficient, and few class samples in the set are difficult to identify. Thus, the classification performance of the classifier becomes critical in dealing with unbalanced samples.

Because the unbalanced data exists in various fields, the invention can be applied to the unbalanced data in various fields. In the real world, for problems in disease diagnosis, credit assessment, etc., accurate classification is often required, and in this case, sample imbalance often makes classification of such data very difficult. For example, in the judgment of a new patient with coronaries, massive personal data are involved, and a series of characteristics such as sex, age, weight, blood pressure, lung information of each person form a sample of the person, and a plurality of persons form a data set, and the categories of the persons are patients and non-patients. It is obvious that perhaps only 10 out of 1000 patients are ill, i.e. the patient always occupies a small part, i.e. a minority, of them; the non-ill is a majority of the categories, and if the ill is misclassified into the non-ill, the result will be catastrophic. Similarly, in bank credit assessment, the age, income, purchasing power and the like of an assessment person can be used as characteristics of a person sample, then the credit degree of the assessment person can be judged, and whether a loan is issued to the assessment person, wherein people with lower credit degree always occupy a small number, and thus the problem of unbalanced data generated by the assessment person needs to be well solved.

Sampling techniques are widely used in methods for processing unbalanced samples. Such as undersampling, oversampling, mixed sampling, etc. The oversampling method balances the number of two types of samples by increasing the number of few types of samples in the samples, so that the classification efficiency is improved. However, the conventional random oversampling concept is to randomly and repeatedly sample a few types of samples, but the method is simple and repeated for original samples, information extraction for the few types of samples is less, information learned by a model is too extensive, and the problem of fitting is particularly easy to generate. Thus, researchers have gradually proposed classical oversampling methods such as SMOTE, borderline-SMOTE, etc. on this basis.

SMOTE is a modified solution based on a random oversampling algorithm, as shown in fig. 2, which calculates its k-nearest neighbor based on minority class samples, sets a sampling ratio according to the sample imbalance ratio to determine sampling rate, selects an appropriate nearest neighbor for each minority class sample, synthesizes a new sample according to the following formula, and the sample on the straight line is identified as a new sample with minority class characteristics and then adds the new sample in the training set.

x_new＝x+rand(0,1)*(x′-x) (1)

In the formula (1), x _new represents a sample finally synthesized, x represents a few kinds of samples inputted, x' represents a neighbor sample of the selected x, and rand (0, 1) is a random number between 0 and 1. Through this formula calculation, new samples can be synthesized based on the sampling rate.

The SMOTE algorithm still has some problems: on the one hand, we need to select a proper number of neighbors, namely k value, and then select the neighbors in a random manner, so that the parameter cannot be effectively determined, and often needs to be demonstrated by trial and error; on the other hand, the distribution of the data in the set is fixed, and the problem of data marginalization is easy to occur, namely, a part of minority class samples are positioned at the edges of the negative class samples, so that the synthesized samples gradually get close to the edges, the boundaries of the positive and negative class samples are blurred, and the classification difficulty is increased. To solve this problem, borderline-SMOTE algorithm was proposed.

Borderline-SMOTE algorithm is modified based on SMOTE algorithm and is currently classified into Borderline-SMOTE1 and Borderline-SMOTE2. The algorithm divides the majority class and minority class of the training set, searches k neighbor for each minority class sample, and obtains the majority class number m (k is larger than or equal to m is larger than or equal to 0) near the sample. If m=k, indicating that the negative samples are all positive, the sample will be considered noise, and the operation is stopped; if m has a value of one half or more of k, then the negative sample is considered to be a sample that is easily misclassified, referred to as a hazard (danger); if the value of m is less than half k, then the negative class sample is considered safe (safe), stopping the operation. The Border-SMOTE1 operates on samples in danger, finds out the neighbor k 'of each danger sample in the negative sample set, and then randomly selects the neighbor k' to synthesize a new sample according to the above formula until the requirement of sampling multiple is met.

Disclosure of Invention

The invention aims to provide an oversampling method based on triangle centroid weight, which solves the problems that in the traditional method in the prior art, a straight line between two points is adopted, a new synthesized sample space is limited between the two points, and information extraction of a sample is less.

The technical scheme adopted by the invention is as follows:

The over-sampling method based on triangle centroid weight is applied to the initial operation of Borderline-SMOTE method, the data is divided into three parts of noise (noise), danger (danger) and safety (safe), then similar neighbors of danger samples are selected, and the final position of a new sample is determined according to the weight and the weight coefficient so as to strengthen the characteristics of the relevant sample, and the method comprises the following specific steps:

step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;

Step 2, carrying out danger-class sample extraction on the quantized samples;

step 3, searching a neighbor sample of danger types of samples;

Step 4, randomly finding out two neighbor samples in the neighbor samples of each danger sample, and calculating the triangle centroid coordinates of three points to obtain centroid samples;

step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step 1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;

Step 6: and determining a weight coefficient of the centroid offset sample according to the genetic algorithm, and multiplying the weight coefficient by the offset centroid to finally obtain a new sample.

The present invention is also characterized in that,

The feature weights are calculated by the Relief method in step 1.

The extraction method of danger types of samples comprises the following steps: dividing most classes and minority classes in the sample to be processed by applying Borderline-SMOTE method ideas, and searching k neighbor for each minority class sample to obtain the number m (k is larger than or equal to m is larger than or equal to 0) of the most classes near the sample. If m=k, indicating that the negative samples are all positive, the sample will be considered noise, and the operation is stopped; if m has a value of one half or more of k, the negative class sample is considered to be a sample which is easy to be misclassified, namely a dangerous (danger) class sample which needs to be acquired; if the value of m is less than half k, then the negative class sample is considered safe (safe), stopping the operation.

The calculation mode of the triangle centroid coordinates is as follows:

Centroid＝(D(A)+D(B)+D(C))/3 (3)

In formula (3), centroid is a triangle centroid coordinate, D (a) is a center point coordinate of danger samples, and D (B) and D (C) are two neighbor samples of danger samples.

The synthesis method of the new sample comprises the following steps:

Newsample＝Centroid*Featrue Weight*Weightcoefficient (4)

Wherein Centroid is a triangle centroid coordinate, featrue Weight is a feature weight, and Weight cofficient is a weight coefficient.

The weight coefficient is determined by a genetic algorithm, an initial population is generated between 0 and 1, the binary string weight coefficient converted from decimal numbers is continuously selected, crossed, mutated and retained in the genetic algorithm, and then the binary string weight coefficient is terminated after the set iteration times are carried out, so that the optimal weight coefficient and classification result are obtained, wherein the initial population number is 10, iteration is carried out for 20 generations, a tournament selection method is adopted for selection, and a two-point crossing mode is adopted, the crossing probability is 0.7, and one-half of the chromosome length is used as the mutation probability.

The beneficial effects of the invention are as follows:

1. The invention applies the initial operation of Borderline-SMOTE method, divides data into three parts of noise (noise), danger (danger) and safety (safe), then selects similar neighbors of danger samples, and determines the final position of a new sample according to the weight and the weight coefficient so as to strengthen the characteristics of related samples.

2. The method provided by the invention is characterized in that on decision trees (Gini) and KNN classifiers, accuracy and area under ROC curve AUC values are used as evaluation indexes to obtain comparison results, and finally, the method provided by the invention has better classification effect than Borderline-SMOTE 1.

Drawings

FIG. 1 is a flow chart of an oversampling method based on triangle centroid weights of the present invention;

fig. 2 is a schematic diagram of the SMOTE oversampling method.

FIG. 3 is a flow chart of a method for obtaining weight coefficients in an oversampling method based on triangle centroid weights in accordance with the present invention;

FIG. 4 is a comparison of AUC values as evaluation criteria using a decision tree classifier;

FIG. 5 comparison of AUC values as evaluation criteria using a KNN classifier;

FIG. 6 is a comparison of accuracy as an evaluation criterion using a decision tree classifier;

fig. 7 is a comparison of accuracy as an evaluation criterion using a KNN classifier.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention relates to an oversampling method based on triangle centroid weight, as shown in figure 1, which applies the initial operation of Borderline-SMOTE method to divide data into three parts of noise (noise), danger (danger) and safety (safe), then selects similar neighbors of danger samples, and determines the final position of new samples according to the weight and weight coefficient to strengthen relevant sample characteristics, and the method comprises the following specific steps:

Step 2, carrying out danger-class sample extraction on the quantized samples;

step 3, searching a neighbor sample of danger types of samples;

In step 1:

the quantization of the values is done in euclidean space;

Wherein the feature weights are calculated by the Relief method:

The Relief method can give different weights to the features according to the category and the relevance of the features, and finally a weight set of each feature centering on the threshold can be obtained according to the set threshold. The algorithm randomly selects a sample B from the training set, then searches for a nearest neighbor sample H from samples of the same class as B, which is called Hit, and searches for a nearest neighbor sample M from samples of different classes from M, which is called Miss. If the distance between B and Hit on a feature is less than the distance between B and Miss, then the feature is indicated to be beneficial to classification, and the weight of the feature is increased; conversely, if the distance between B and Hit is greater than the distance between B and Miss for a feature, indicating that the feature is negatively affecting classification, the weight of the feature is reduced.

We set the initial weight to 1 and calculate the weight according to the following formula:

W(A)＝W(A)-diff(A,B,H)/m+diff(A,B,M)/m (2)

In the formula (2), a is the weight of each feature, m is the number of times of execution, two diffs are the euclidean distances between the sample and Hit and Miss, and the feature a is circularly calculated from 1 to N, thereby obtaining a first generation weight value. And repeating the above process for m times to finally obtain the average weight of each feature.

Considering that the initial weight is affected by the setting, we recalculate the output average weights in a percentage manner so that each feature weight is less than 1, and the magnitude relation between each other is not changed.

In step 2:

In step 3:

After danger samples are extracted, their neighbors need to be found. And searching for the neighbor of the danger sample in a few samples of the original training set as a parent sample of the new sample synthesized subsequently. This algorithm can both find the sample points of interest and make the extraction of these points closer to the minority class of sample points that have a trend but are not easily resolved, where k=5.

In the over-sampling method based on triangle centroid weights, the idea of searching for neighbors of danger samples among a few classes is also applied.

In step 4:

The calculation mode of the triangle centroid coordinates is as follows:

Centroid＝(D(A)+D(B)+D(C))/3 (3)

In step 5: after the centroid is obtained in step 4, the centroid is multiplied by the feature weight output in step 1, so that the new sample is not limited to a fixed position, but is offset to a certain extent according to the importance of each feature.

We determine how many new minority class samples to synthesize based on the input sampling rate, and the purpose of this algorithm is to balance the unbalanced samples, so the number of new samples synthesized should be the difference between the minority samples and the majority samples in the training set, and the danger class samples will be circularly synthesized until the number of samples are not synthesized.

In step 6:

The concept of a weight coefficient is introduced, because the feature weight obtained by optimization in the step 1 still cannot accurately guide the centroid to shift to a proper position, therefore, a weight coefficient between 0 and 1 is set, the front opening and the rear closing are performed, a new sample is finally obtained by multiplying the weight coefficient by the shift centroid, and the new sample can effectively reach the most proper position.

As shown in fig. 3, the weight coefficient is determined by genetic algorithm, an initial population is generated between 0 and 1, and the population retention of the best evaluation index can be obtained by continuously selecting, crossing and mutating the binary string weight coefficient converted from decimal number, and the population retention is terminated after the set iteration times are carried out, so that the best weight coefficient and classification result are obtained. The initial population number is set to be 10, the iteration is carried out for 20 generations, the tournament selection method is adopted for selection, the two-point crossing mode is adopted, the crossing probability is 0.7, and one-half of the chromosome length is used as the mutation probability, and the mutation probability is shown in table 1.

Table 1 genetic algorithm parameter settings

The synthesis method of the new sample comprises the following steps:

Newsample＝Centroid*Featrue Weight*Weightcoefficient (4)

wherein Centroid is the triangle centroid coordinates, featrue Weight is the weight of each feature, and Weight cofficient is the weight coefficient.

The classification Accuracy (Accuracy) and area under ROC curve (AUC) for 7 data sets in this case were evaluated using a 10-fold cross-validation method as classifier on Decision Tree (DT) and KNN, respectively. The k-fold cross validation divides the selected data set into k groups equally, 1-fold data in the k groups is used as a test set, other k-1 folds are used as training sets, and the like, k training models and test models can be obtained altogether, and the average value of the validation of the k training models and the AUC value is taken as a final result. In the invention, a 10-fold cross-validation method is adopted, and above that, 10 times of validation are carried out on each 10-fold cross-validation, and the result of 100 times of evaluation is taken as the final result.

Accuracy (Accuracy) and area under receiver operating characteristic curve (ROC) (AUC) are classifier evaluation indicators commonly used in unbalanced data classification.

The accuracy is the most commonly used classifier evaluation index, and the calculation method is as follows:

Wherein,

TP (Ture Positive) represents the number of instances that are actually positive and are divided into positive by the classifier.

TN (Ture Negative) represents the number of instances that are actually negative and are divided into negative by the classifier.

FP (False Positive) denotes the number of instances that are actually negative examples but are divided into positive examples by the classifier.

FN (False Negative) denotes the number of instances that are actually positive instances but are divided into negative instances by the classifier.

The receiver operation characteristic curve (ROC) is a curve obtained by taking the false alarm probability as the abscissa and the hit probability as the ordinate under different classification thresholds, and in the first quadrant of the coordinate axis with the length of 1 in the abscissa, the closer the line is to the upper left corner, the better the classification performance of the classifier is generally considered. However, in order to avoid the non-visual judgment caused by curve intersection, the area AUC under the ROC curve is used as an evaluation index, and the larger the AUC value is, the better the classification performance is.

Wherein,

To verify the validity of the present invention, a KEEL dataset is used as the sample to be processed, the KEEL dataset is described as in Table 2, dataset is the name of the dataset, instance is the sample size in the dataset, features is the number of Features of the dataset, and Classes is the number of Classes of the dataset.

The present invention uses the present method for verification of each of the 8 data sets, wherein the two classification algorithms used are decision tree and KNN. Through experiments, the oversampling method based on the triangle centroid weight provided by the invention obtains higher accuracy and AUC values on classification of 8 data sets compared with Borderline-SMOTE 1.

Table 2 KEEL dataset description

Table 3 is a two-way comparison table in which classification accuracy is used as an evaluation criterion, and table 4 is a two-way comparison table in which AUC is used as an evaluation criterion. My methods in the two tables are the Method provided by the invention, DT represents that a decision tree is used for the basic classification algorithm, and KNN represents that a K neighbor classification algorithm is used for the basic classification algorithm.

Table 3 accuracy as a comparison of classification under evaluation criteria (%)

Table 4AUC values as a comparison of classifications under evaluation criteria

Fig. 4 to 7 in the specification are experimental comparative diagrams of classification on different evaluation indexes by oversampling using two algorithms. The left is Borderline-SMOTE1 algorithm and the right is the method proposed by the present invention.

Fig. 4 is a comparative bar graph of two algorithms using DT as the basis classification algorithm on AUC values, and experiments show that the accuracy of the proposed oversampling algorithm using decision tree classification on 8 datasets is higher than that of the Borderline-SMOTE1 algorithm.

Fig. 5 is a comparative bar graph of the two algorithms using KNN as the basis classification algorithm on AUC values, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 datasets is higher than that of the Borderline-SMOTE1 algorithm.

Fig. 6 is a comparative bar graph of the accuracy of the two algorithms using DT as the basis classification algorithm, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 datasets is higher than that of the Borderline-SMOTE1 algorithm.

Fig. 7 is a comparison bar graph of the accuracy of the two algorithms using KNN as the base classification algorithm, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 data sets is higher than that of the Borderline-SMOTE1 algorithm.

Claims

1. The over-sampling method based on triangle centroid weight is characterized in that an initial operation of Borderline-SMOTE method is applied, data are divided into three parts of noise (noise), danger (danger) and safety (safe), then similar neighbors of danger samples are selected, and final positions of new samples are determined according to the weight and the weight coefficient so as to strengthen relevant sample characteristics, and the method comprises the following specific steps:

Step 2, carrying out danger-class sample extraction on the quantized samples;

step 3, searching a neighbor sample of danger types of samples;

step 4, randomly finding out two neighbor samples for each danger neighbor sample, and calculating the triangle barycenter coordinates of three points to obtain a barycenter sample;

step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;

step 6: and determining a weight coefficient of the centroid offset sample according to a genetic algorithm, and multiplying the weight coefficient by the offset centroid to finally obtain a new sample.

2. The method of claim 1, wherein the feature weights are calculated in step 1 by a Relief method.

3. The oversampling method based on triangle centroid weight of claim 1, wherein the danger samples extraction method is: dividing most classes and minority classes in a sample to be processed by applying Borderline-SMOTE method ideas, and searching k neighbors for each minority class sample to obtain the number m (k is more than or equal to m is more than or equal to 0) of the most classes near the sample; if m=k, indicating that the negative samples are all positive, the sample will be considered noise, and the operation is stopped; if m has a value of one half or more of k, the negative class sample is considered to be a sample which is easy to be misclassified, namely a dangerous (danger) class sample which needs to be acquired; if the value of m is less than half k, then the negative class sample is considered safe (safe), stopping the operation.

4. The oversampling method based on triangle centroid weight according to claim 1, wherein the triangle centroid coordinates are calculated by:

Centroid＝(D(A)+D(B)+D(C))/3 (3)

5. The oversampling method based on triangle centroid weight according to claim 1, wherein the synthesis method of the new sample is:

Newsample＝Centroid*Featrue Weight*Weightcoefficient (4)

6. The method for oversampling based on triangle centroid weight according to claim 1, wherein the weight coefficient is determined by genetic algorithm, an initial population is generated between 0 and 1, the binary string weight coefficient converted from decimal number is continuously selected by genetic algorithm, crossed, mutated, population reserved, and terminated after the set number of iterations, the optimal weight coefficient and classification result are obtained, wherein the initial population is 10, iterated 20 generations, selected by tournament selection method, two-point crossing mode, 0.7 cross probability, chromosome length fraction are used as mutation probability.