CN113792765B - Oversampling method based on triangle centroid weight - Google Patents

Oversampling method based on triangle centroid weight Download PDF

Info

Publication number
CN113792765B
CN113792765B CN202110976931.5A CN202110976931A CN113792765B CN 113792765 B CN113792765 B CN 113792765B CN 202110976931 A CN202110976931 A CN 202110976931A CN 113792765 B CN113792765 B CN 113792765B
Authority
CN
China
Prior art keywords
sample
centroid
weight
samples
danger
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110976931.5A
Other languages
Chinese (zh)
Other versions
CN113792765A (en
Inventor
周红芳
陈佳琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Guanggu Kangfu Information Technology Co ltd
Original Assignee
Wuhan Guanggu Kangfu Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Guanggu Kangfu Information Technology Co ltd filed Critical Wuhan Guanggu Kangfu Information Technology Co ltd
Priority to CN202110976931.5A priority Critical patent/CN113792765B/en
Publication of CN113792765A publication Critical patent/CN113792765A/en
Application granted granted Critical
Publication of CN113792765B publication Critical patent/CN113792765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an oversampling method based on triangle centroid weight, which comprises the following specific steps: step 1, quantifying a sample to be processed into a numerical value and then calculating a characteristic weight; step 2, carrying out danger-class sample extraction on the quantized samples; step 3, searching a neighbor sample of danger types of samples; step 4, randomly finding out two neighbor samples in the neighbor samples of each danger sample, and calculating the triangle centroid coordinates of three points to obtain centroid samples; step 5, multiplying each feature of the centroid coordinates in the centroid sample by a feature weight to obtain an offset centroid to form a centroid offset sample; step 6: and determining a weight coefficient of the centroid offset sample according to the genetic algorithm, and multiplying the weight coefficient by the offset centroid to finally obtain a new sample. The invention solves the problems that the traditional method adopts a method of adopting a straight line between two points, the space of a new synthesized sample is limited between the two points, and the information extraction of the sample is less.

Description

Oversampling method based on triangle centroid weight
Technical Field
The invention belongs to the technical field of data mining and machine learning data processing, and relates to an oversampling method based on triangle centroid weights.
Background
With the advent of the big data age, a vast diversity of data information has emerged into our lives, with unbalanced data being one of the typical representatives. Unbalanced data refers to unbalanced distribution of data samples among different data categories, and classification problems thereof are commonly existed in the field of artificial intelligence and data mining. Of these two-class or multi-class problems affected by the imbalance problem, we refer to the class with more samples as the majority class or the positive class and the class with less samples as the minority class or the negative class. When classifying, the traditional classification algorithm mainly aims at data samples with balanced data distribution, but when unbalanced data is processed, the classifier becomes inefficient, and few class samples in the set are difficult to identify. Thus, the classification performance of the classifier becomes critical in dealing with unbalanced samples.
Because the unbalanced data exists in various fields, the invention can be applied to the unbalanced data in various fields. In the real world, for problems in disease diagnosis, credit assessment, etc., accurate classification is often required, and in this case, sample imbalance often makes classification of such data very difficult. For example, in the judgment of a new patient with coronaries, massive personal data are involved, and a series of characteristics such as sex, age, weight, blood pressure, lung information of each person form a sample of the person, and a plurality of persons form a data set, and the categories of the persons are patients and non-patients. It is obvious that perhaps only 10 out of 1000 patients are ill, i.e. the patient always occupies a small part, i.e. a minority, of them; the non-ill is a majority of the categories, and if the ill is misclassified into the non-ill, the result will be catastrophic. Similarly, in bank credit assessment, the age, income, purchasing power and the like of an assessment person can be used as characteristics of a person sample, then the credit degree of the assessment person can be judged, and whether a loan is issued to the assessment person, wherein people with lower credit degree always occupy a small number, and thus the problem of unbalanced data generated by the assessment person needs to be well solved.
Sampling techniques are widely used in methods for processing unbalanced samples. Such as undersampling, oversampling, mixed sampling, etc. The oversampling method balances the number of two types of samples by increasing the number of few types of samples in the samples, so that the classification efficiency is improved. However, the conventional random oversampling concept is to randomly and repeatedly sample a few types of samples, but the method is simple and repeated for original samples, information extraction for the few types of samples is less, information learned by a model is too extensive, and the problem of fitting is particularly easy to generate. Thus, researchers have gradually proposed classical oversampling methods such as SMOTE, borderline-SMOTE, etc. on this basis.
SMOTE is a modified solution based on a random oversampling algorithm, as shown in fig. 2, which calculates its k-nearest neighbor based on minority class samples, sets a sampling ratio according to the sample imbalance ratio to determine sampling rate, selects an appropriate nearest neighbor for each minority class sample, synthesizes a new sample according to the following formula, and the sample on the straight line is identified as a new sample with minority class characteristics and then adds the new sample in the training set.
xnew=x+rand(0,1)*(x′-x) (1)
In the formula (1), x new represents a sample finally synthesized, x represents a few kinds of samples inputted, x' represents a neighbor sample of the selected x, and rand (0, 1) is a random number between 0 and 1. Through this formula calculation, new samples can be synthesized based on the sampling rate.
The SMOTE algorithm still has some problems: on the one hand, we need to select a proper number of neighbors, namely k value, and then select the neighbors in a random manner, so that the parameter cannot be effectively determined, and often needs to be demonstrated by trial and error; on the other hand, the distribution of the data in the set is fixed, and the problem of data marginalization is easy to occur, namely, a part of minority class samples are positioned at the edges of the negative class samples, so that the synthesized samples gradually get close to the edges, the boundaries of the positive and negative class samples are blurred, and the classification difficulty is increased. To solve this problem, borderline-SMOTE algorithm was proposed.
Borderline-SMOTE algorithm is modified based on SMOTE algorithm and is currently classified into Borderline-SMOTE1 and Borderline-SMOTE2. The algorithm divides the majority class and minority class of the training set, searches k neighbor for each minority class sample, and obtains the majority class number m (k is larger than or equal to m is larger than or equal to 0) near the sample. If m=k, indicating that the negative samples are all positive, the sample will be considered noise, and the operation is stopped; if m has a value of one half or more of k, then the negative sample is considered to be a sample that is easily misclassified, referred to as a hazard (danger); if the value of m is less than half k, then the negative class sample is considered safe (safe), stopping the operation. The Border-SMOTE1 operates on samples in danger, finds out the neighbor k 'of each danger sample in the negative sample set, and then randomly selects the neighbor k' to synthesize a new sample according to the above formula until the requirement of sampling multiple is met.
Disclosure of Invention
The invention aims to provide an oversampling method based on triangle centroid weight, which solves the problems that in the traditional method in the prior art, a straight line between two points is adopted, a new synthesized sample space is limited between the two points, and information extraction of a sample is less.
The technical scheme adopted by the invention is as follows:
The over-sampling method based on triangle centroid weight is applied to the initial operation of Borderline-SMOTE method, the data is divided into three parts of noise (noise), danger (danger) and safety (safe), then similar neighbors of danger samples are selected, and the final position of a new sample is determined according to the weight and the weight coefficient so as to strengthen the characteristics of the relevant sample, and the method comprises the following specific steps:
step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;
Step 2, carrying out danger-class sample extraction on the quantized samples;
step 3, searching a neighbor sample of danger types of samples;
Step 4, randomly finding out two neighbor samples in the neighbor samples of each danger sample, and calculating the triangle centroid coordinates of three points to obtain centroid samples;
step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step 1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;
Step 6: and determining a weight coefficient of the centroid offset sample according to the genetic algorithm, and multiplying the weight coefficient by the offset centroid to finally obtain a new sample.
The present invention is also characterized in that,
The feature weights are calculated by the Relief method in step 1.
The extraction method of danger types of samples comprises the following steps: dividing most classes and minority classes in the sample to be processed by applying Borderline-SMOTE method ideas, and searching k neighbor for each minority class sample to obtain the number m (k is larger than or equal to m is larger than or equal to 0) of the most classes near the sample. If m=k, indicating that the negative samples are all positive, the sample will be considered noise, and the operation is stopped; if m has a value of one half or more of k, the negative class sample is considered to be a sample which is easy to be misclassified, namely a dangerous (danger) class sample which needs to be acquired; if the value of m is less than half k, then the negative class sample is considered safe (safe), stopping the operation.
The calculation mode of the triangle centroid coordinates is as follows:
Centroid=(D(A)+D(B)+D(C))/3 (3)
In formula (3), centroid is a triangle centroid coordinate, D (a) is a center point coordinate of danger samples, and D (B) and D (C) are two neighbor samples of danger samples.
The synthesis method of the new sample comprises the following steps:
Newsample=Centroid*Featrue Weight*Weightcoefficient (4)
Wherein Centroid is a triangle centroid coordinate, featrue Weight is a feature weight, and Weight cofficient is a weight coefficient.
The weight coefficient is determined by a genetic algorithm, an initial population is generated between 0 and 1, the binary string weight coefficient converted from decimal numbers is continuously selected, crossed, mutated and retained in the genetic algorithm, and then the binary string weight coefficient is terminated after the set iteration times are carried out, so that the optimal weight coefficient and classification result are obtained, wherein the initial population number is 10, iteration is carried out for 20 generations, a tournament selection method is adopted for selection, and a two-point crossing mode is adopted, the crossing probability is 0.7, and one-half of the chromosome length is used as the mutation probability.
The beneficial effects of the invention are as follows:
1. The invention applies the initial operation of Borderline-SMOTE method, divides data into three parts of noise (noise), danger (danger) and safety (safe), then selects similar neighbors of danger samples, and determines the final position of a new sample according to the weight and the weight coefficient so as to strengthen the characteristics of related samples.
2. The method provided by the invention is characterized in that on decision trees (Gini) and KNN classifiers, accuracy and area under ROC curve AUC values are used as evaluation indexes to obtain comparison results, and finally, the method provided by the invention has better classification effect than Borderline-SMOTE 1.
Drawings
FIG. 1 is a flow chart of an oversampling method based on triangle centroid weights of the present invention;
fig. 2 is a schematic diagram of the SMOTE oversampling method.
FIG. 3 is a flow chart of a method for obtaining weight coefficients in an oversampling method based on triangle centroid weights in accordance with the present invention;
FIG. 4 is a comparison of AUC values as evaluation criteria using a decision tree classifier;
FIG. 5 comparison of AUC values as evaluation criteria using a KNN classifier;
FIG. 6 is a comparison of accuracy as an evaluation criterion using a decision tree classifier;
fig. 7 is a comparison of accuracy as an evaluation criterion using a KNN classifier.
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
The invention relates to an oversampling method based on triangle centroid weight, as shown in figure 1, which applies the initial operation of Borderline-SMOTE method to divide data into three parts of noise (noise), danger (danger) and safety (safe), then selects similar neighbors of danger samples, and determines the final position of new samples according to the weight and weight coefficient to strengthen relevant sample characteristics, and the method comprises the following specific steps:
step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;
Step 2, carrying out danger-class sample extraction on the quantized samples;
step 3, searching a neighbor sample of danger types of samples;
Step 4, randomly finding out two neighbor samples in the neighbor samples of each danger sample, and calculating the triangle centroid coordinates of three points to obtain centroid samples;
step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step 1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;
Step 6: and determining a weight coefficient of the centroid offset sample according to the genetic algorithm, and multiplying the weight coefficient by the offset centroid to finally obtain a new sample.
In step 1:
the quantization of the values is done in euclidean space;
Wherein the feature weights are calculated by the Relief method:
The Relief method can give different weights to the features according to the category and the relevance of the features, and finally a weight set of each feature centering on the threshold can be obtained according to the set threshold. The algorithm randomly selects a sample B from the training set, then searches for a nearest neighbor sample H from samples of the same class as B, which is called Hit, and searches for a nearest neighbor sample M from samples of different classes from M, which is called Miss. If the distance between B and Hit on a feature is less than the distance between B and Miss, then the feature is indicated to be beneficial to classification, and the weight of the feature is increased; conversely, if the distance between B and Hit is greater than the distance between B and Miss for a feature, indicating that the feature is negatively affecting classification, the weight of the feature is reduced.
We set the initial weight to 1 and calculate the weight according to the following formula:
W(A)=W(A)-diff(A,B,H)/m+diff(A,B,M)/m (2)
In the formula (2), a is the weight of each feature, m is the number of times of execution, two diffs are the euclidean distances between the sample and Hit and Miss, and the feature a is circularly calculated from 1 to N, thereby obtaining a first generation weight value. And repeating the above process for m times to finally obtain the average weight of each feature.
Considering that the initial weight is affected by the setting, we recalculate the output average weights in a percentage manner so that each feature weight is less than 1, and the magnitude relation between each other is not changed.
In step 2:
The extraction method of danger types of samples comprises the following steps: dividing most classes and minority classes in the sample to be processed by applying Borderline-SMOTE method ideas, and searching k neighbor for each minority class sample to obtain the number m (k is larger than or equal to m is larger than or equal to 0) of the most classes near the sample. If m=k, indicating that the negative samples are all positive, the sample will be considered noise, and the operation is stopped; if m has a value of one half or more of k, the negative class sample is considered to be a sample which is easy to be misclassified, namely a dangerous (danger) class sample which needs to be acquired; if the value of m is less than half k, then the negative class sample is considered safe (safe), stopping the operation.
In step 3:
After danger samples are extracted, their neighbors need to be found. And searching for the neighbor of the danger sample in a few samples of the original training set as a parent sample of the new sample synthesized subsequently. This algorithm can both find the sample points of interest and make the extraction of these points closer to the minority class of sample points that have a trend but are not easily resolved, where k=5.
In the over-sampling method based on triangle centroid weights, the idea of searching for neighbors of danger samples among a few classes is also applied.
In step 4:
The calculation mode of the triangle centroid coordinates is as follows:
Centroid=(D(A)+D(B)+D(C))/3 (3)
In formula (3), centroid is a triangle centroid coordinate, D (a) is a center point coordinate of danger samples, and D (B) and D (C) are two neighbor samples of danger samples.
In step 5: after the centroid is obtained in step 4, the centroid is multiplied by the feature weight output in step 1, so that the new sample is not limited to a fixed position, but is offset to a certain extent according to the importance of each feature.
We determine how many new minority class samples to synthesize based on the input sampling rate, and the purpose of this algorithm is to balance the unbalanced samples, so the number of new samples synthesized should be the difference between the minority samples and the majority samples in the training set, and the danger class samples will be circularly synthesized until the number of samples are not synthesized.
In step 6:
The concept of a weight coefficient is introduced, because the feature weight obtained by optimization in the step 1 still cannot accurately guide the centroid to shift to a proper position, therefore, a weight coefficient between 0 and 1 is set, the front opening and the rear closing are performed, a new sample is finally obtained by multiplying the weight coefficient by the shift centroid, and the new sample can effectively reach the most proper position.
As shown in fig. 3, the weight coefficient is determined by genetic algorithm, an initial population is generated between 0 and 1, and the population retention of the best evaluation index can be obtained by continuously selecting, crossing and mutating the binary string weight coefficient converted from decimal number, and the population retention is terminated after the set iteration times are carried out, so that the best weight coefficient and classification result are obtained. The initial population number is set to be 10, the iteration is carried out for 20 generations, the tournament selection method is adopted for selection, the two-point crossing mode is adopted, the crossing probability is 0.7, and one-half of the chromosome length is used as the mutation probability, and the mutation probability is shown in table 1.
Table 1 genetic algorithm parameter settings
The synthesis method of the new sample comprises the following steps:
Newsample=Centroid*Featrue Weight*Weightcoefficient (4)
wherein Centroid is the triangle centroid coordinates, featrue Weight is the weight of each feature, and Weight cofficient is the weight coefficient.
The classification Accuracy (Accuracy) and area under ROC curve (AUC) for 7 data sets in this case were evaluated using a 10-fold cross-validation method as classifier on Decision Tree (DT) and KNN, respectively. The k-fold cross validation divides the selected data set into k groups equally, 1-fold data in the k groups is used as a test set, other k-1 folds are used as training sets, and the like, k training models and test models can be obtained altogether, and the average value of the validation of the k training models and the AUC value is taken as a final result. In the invention, a 10-fold cross-validation method is adopted, and above that, 10 times of validation are carried out on each 10-fold cross-validation, and the result of 100 times of evaluation is taken as the final result.
Accuracy (Accuracy) and area under receiver operating characteristic curve (ROC) (AUC) are classifier evaluation indicators commonly used in unbalanced data classification.
The accuracy is the most commonly used classifier evaluation index, and the calculation method is as follows:
Wherein,
TP (Ture Positive) represents the number of instances that are actually positive and are divided into positive by the classifier.
TN (Ture Negative) represents the number of instances that are actually negative and are divided into negative by the classifier.
FP (False Positive) denotes the number of instances that are actually negative examples but are divided into positive examples by the classifier.
FN (False Negative) denotes the number of instances that are actually positive instances but are divided into negative instances by the classifier.
The receiver operation characteristic curve (ROC) is a curve obtained by taking the false alarm probability as the abscissa and the hit probability as the ordinate under different classification thresholds, and in the first quadrant of the coordinate axis with the length of 1 in the abscissa, the closer the line is to the upper left corner, the better the classification performance of the classifier is generally considered. However, in order to avoid the non-visual judgment caused by curve intersection, the area AUC under the ROC curve is used as an evaluation index, and the larger the AUC value is, the better the classification performance is.
Wherein,
To verify the validity of the present invention, a KEEL dataset is used as the sample to be processed, the KEEL dataset is described as in Table 2, dataset is the name of the dataset, instance is the sample size in the dataset, features is the number of Features of the dataset, and Classes is the number of Classes of the dataset.
The present invention uses the present method for verification of each of the 8 data sets, wherein the two classification algorithms used are decision tree and KNN. Through experiments, the oversampling method based on the triangle centroid weight provided by the invention obtains higher accuracy and AUC values on classification of 8 data sets compared with Borderline-SMOTE 1.
Table 2 KEEL dataset description
Table 3 is a two-way comparison table in which classification accuracy is used as an evaluation criterion, and table 4 is a two-way comparison table in which AUC is used as an evaluation criterion. My methods in the two tables are the Method provided by the invention, DT represents that a decision tree is used for the basic classification algorithm, and KNN represents that a K neighbor classification algorithm is used for the basic classification algorithm.
Table 3 accuracy as a comparison of classification under evaluation criteria (%)
Table 4AUC values as a comparison of classifications under evaluation criteria
Fig. 4 to 7 in the specification are experimental comparative diagrams of classification on different evaluation indexes by oversampling using two algorithms. The left is Borderline-SMOTE1 algorithm and the right is the method proposed by the present invention.
Fig. 4 is a comparative bar graph of two algorithms using DT as the basis classification algorithm on AUC values, and experiments show that the accuracy of the proposed oversampling algorithm using decision tree classification on 8 datasets is higher than that of the Borderline-SMOTE1 algorithm.
Fig. 5 is a comparative bar graph of the two algorithms using KNN as the basis classification algorithm on AUC values, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 datasets is higher than that of the Borderline-SMOTE1 algorithm.
Fig. 6 is a comparative bar graph of the accuracy of the two algorithms using DT as the basis classification algorithm, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 datasets is higher than that of the Borderline-SMOTE1 algorithm.
Fig. 7 is a comparison bar graph of the accuracy of the two algorithms using KNN as the base classification algorithm, and experiments show that the accuracy of the oversampling algorithm proposed by the present invention using decision tree classification on 8 data sets is higher than that of the Borderline-SMOTE1 algorithm.

Claims (6)

1. The over-sampling method based on triangle centroid weight is characterized in that an initial operation of Borderline-SMOTE method is applied, data are divided into three parts of noise (noise), danger (danger) and safety (safe), then similar neighbors of danger samples are selected, and final positions of new samples are determined according to the weight and the weight coefficient so as to strengthen relevant sample characteristics, and the method comprises the following specific steps:
step 1, quantizing a sample to be processed into a numerical value, and calculating the characteristic weight of the quantized sample;
Step 2, carrying out danger-class sample extraction on the quantized samples;
step 3, searching a neighbor sample of danger types of samples;
step 4, randomly finding out two neighbor samples for each danger neighbor sample, and calculating the triangle barycenter coordinates of three points to obtain a barycenter sample;
step 5, multiplying each feature of the centroid coordinates in the centroid sample by the feature weight obtained in the step1 to obtain an offset centroid, wherein all the offset centroids form a centroid offset sample;
step 6: and determining a weight coefficient of the centroid offset sample according to a genetic algorithm, and multiplying the weight coefficient by the offset centroid to finally obtain a new sample.
2. The method of claim 1, wherein the feature weights are calculated in step 1 by a Relief method.
3. The oversampling method based on triangle centroid weight of claim 1, wherein the danger samples extraction method is: dividing most classes and minority classes in a sample to be processed by applying Borderline-SMOTE method ideas, and searching k neighbors for each minority class sample to obtain the number m (k is more than or equal to m is more than or equal to 0) of the most classes near the sample; if m=k, indicating that the negative samples are all positive, the sample will be considered noise, and the operation is stopped; if m has a value of one half or more of k, the negative class sample is considered to be a sample which is easy to be misclassified, namely a dangerous (danger) class sample which needs to be acquired; if the value of m is less than half k, then the negative class sample is considered safe (safe), stopping the operation.
4. The oversampling method based on triangle centroid weight according to claim 1, wherein the triangle centroid coordinates are calculated by:
Centroid=(D(A)+D(B)+D(C))/3 (3)
In formula (3), centroid is a triangle centroid coordinate, D (a) is a center point coordinate of danger samples, and D (B) and D (C) are two neighbor samples of danger samples.
5. The oversampling method based on triangle centroid weight according to claim 1, wherein the synthesis method of the new sample is:
Newsample=Centroid*Featrue Weight*Weightcoefficient (4)
wherein Centroid is the triangle centroid coordinates, featrue Weight is the weight of each feature, and Weight cofficient is the weight coefficient.
6. The method for oversampling based on triangle centroid weight according to claim 1, wherein the weight coefficient is determined by genetic algorithm, an initial population is generated between 0 and 1, the binary string weight coefficient converted from decimal number is continuously selected by genetic algorithm, crossed, mutated, population reserved, and terminated after the set number of iterations, the optimal weight coefficient and classification result are obtained, wherein the initial population is 10, iterated 20 generations, selected by tournament selection method, two-point crossing mode, 0.7 cross probability, chromosome length fraction are used as mutation probability.
CN202110976931.5A 2021-08-24 2021-08-24 Oversampling method based on triangle centroid weight Active CN113792765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110976931.5A CN113792765B (en) 2021-08-24 2021-08-24 Oversampling method based on triangle centroid weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110976931.5A CN113792765B (en) 2021-08-24 2021-08-24 Oversampling method based on triangle centroid weight

Publications (2)

Publication Number Publication Date
CN113792765A CN113792765A (en) 2021-12-14
CN113792765B true CN113792765B (en) 2024-06-14

Family

ID=79182293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110976931.5A Active CN113792765B (en) 2021-08-24 2021-08-24 Oversampling method based on triangle centroid weight

Country Status (1)

Country Link
CN (1) CN113792765B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5799114A (en) * 1993-05-05 1998-08-25 Liberty Technologies, Inc. System and method for stable analysis of sampled transients arbitrarily aligned with their sample points
CN107563435A (en) * 2017-08-30 2018-01-09 哈尔滨工业大学深圳研究生院 Higher-dimension unbalanced data sorting technique based on SVM
CN111259964B (en) * 2020-01-17 2023-04-07 上海海事大学 Over-sampling method for unbalanced data set
CN111931853A (en) * 2020-08-12 2020-11-13 桂林电子科技大学 Oversampling method based on hierarchical clustering and improved SMOTE
CN112633337A (en) * 2020-12-14 2021-04-09 哈尔滨理工大学 Unbalanced data processing method based on clustering and boundary points

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于遗传算法改进的少数类样本合成过采样技术的非平衡数据集分类算法;霍玉丹;谷琼;蔡之华;袁磊;;计算机应用(第01期);全文 *
改进SMOTE的非平衡数据集分类算法研究;赵清华;张艺豪;马建芬;段倩倩;;计算机工程与应用(第18期);全文 *

Also Published As

Publication number Publication date
CN113792765A (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN111400180B (en) Software defect prediction method based on feature set division and ensemble learning
De Amorim Feature relevance in ward’s hierarchical clustering using the L p norm
Isa et al. Using the self organizing map for clustering of text documents
CN106202952A (en) A kind of Parkinson disease diagnostic method based on machine learning
CN110674846A (en) Genetic algorithm and k-means clustering-based unbalanced data set oversampling method
CN111626336A (en) Subway fault data classification method based on unbalanced data set
CN115048988B (en) Unbalanced data set classification fusion method based on Gaussian mixture model
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN112801140A (en) XGboost breast cancer rapid diagnosis method based on moth fire suppression optimization algorithm
Zhang et al. A hybrid feature selection algorithm for classification unbalanced data processsing
CN111709460A (en) Mutual information characteristic selection method based on correlation coefficient
CN113792765B (en) Oversampling method based on triangle centroid weight
CN117407732A (en) Unconventional reservoir gas well yield prediction method based on antagonistic neural network
Jiang et al. Undersampling of approaching the classification boundary for imbalance problem
CN112183598A (en) Feature selection method based on genetic algorithm
Gao et al. An ensemble classifier learning approach to ROC optimization
CN115116619A (en) Intelligent analysis method and system for stroke data distribution rule
CN114358191A (en) Gene expression data clustering method based on depth automatic encoder
CN113392908A (en) Unbalanced data oversampling algorithm based on boundary density
Kurniawati et al. Model optimisation of class imbalanced learning using ensemble classifier on over-sampling data
CN112308160A (en) K-means clustering artificial intelligence optimization algorithm
CN111488903A (en) Decision tree feature selection method based on feature weight
Rosales-Pérez et al. Genetic selection of fuzzy model for acute leukemia classification
Hwang Identification of a Gaussian fuzzy classifier
Suksut et al. Support vector machine with restarting genetic algorithm for classifying imbalanced data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240417

Address after: 518000 1002, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Applicant after: Shenzhen Wanzhida Technology Co.,Ltd.

Country or region after: China

Address before: 710048 Shaanxi province Xi'an Beilin District Jinhua Road No. 5

Applicant before: XI'AN University OF TECHNOLOGY

Country or region before: China

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20240512

Address after: 430000, 1st floor, Building B9, Optics Valley Biotech City Innovation Park, No. 666 Gaoxin Avenue, Donghu New Technology Development Zone, Wuhan City, Hubei Province

Applicant after: Wuhan Guanggu Kangfu Information Technology Co.,Ltd.

Country or region after: China

Address before: 518000 1002, Building A, Zhiyun Industrial Park, No. 13, Huaxing Road, Henglang Community, Longhua District, Shenzhen, Guangdong Province

Applicant before: Shenzhen Wanzhida Technology Co.,Ltd.

Country or region before: China

GR01 Patent grant
GR01 Patent grant