CN113344075A

CN113344075A - High-dimensional unbalanced data classification method based on feature learning and ensemble learning

Info

Publication number: CN113344075A
Application number: CN202110615623.XA
Authority: CN
Inventors: 陈佐; 张志刚; 杨胜刚; 杨捷琳
Original assignee: Hunan Huda Jinke Technology Development Co ltd
Current assignee: Hunan Huda Jinke Technology Development Co ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-09-03

Abstract

The invention discloses a high-dimensional unbalanced data classification method based on feature learning and ensemble learning, which comprises the following steps of: carrying out equalization processing on the original data set by using a mixed sampling algorithm to obtain an equalized data set; the original data set is a set of a plurality of high-dimensional unbalanced data; carrying out feature selection on the equalized data set by using a feature selection algorithm to obtain an optimal feature set; and classifying the optimal feature set by using the trained Stacking ensemble learning model. The invention aims to provide a high-dimensional unbalanced data classification method based on feature learning and ensemble learning so as to improve the classification accuracy of high-dimensional unbalanced data.

Description

High-dimensional unbalanced data classification method based on feature learning and ensemble learning

Technical Field

The invention relates to the technical field of high-dimensional data classification, in particular to a high-dimensional unbalanced data classification method based on feature learning and ensemble learning.

Background

With the explosive growth of the internet and the mobile intelligent terminal, a large amount of Data is accumulated in various fields such as electronic commerce, financial technology, medical technology, government departments, internet industry and the like, and the Data back contains a large amount of valuable information, so that Data Mining (DM) on the Data becomes very important. The effective classification of data is one of the important directions of data mining. The classification means that training is carried out through a classification model according to the existing data characteristics and classes, and after the classification model is continuously optimized, classification processing is carried out on data without classes.

In big data, high-dimensional unbalanced data are very common, the high-dimensional data refer to that the characteristics of the data are very many, the characteristics of some data can reach thousands or even tens of thousands, the training difficulty of the classification model is increased along with the increase of the dimensions of the high-dimensional data, and the performance of the model is lower and lower. The features and the categories and the features usually have certain associations, if the features can be screened by a certain method, the features only relevant to the categories are selected, and the features irrelevant to the categories are all removed, then the dimension reduction processing on the high-dimensional features can be realized, and the feature selection method is one of the methods. The feature selection method deletes irrelevant features and redundant features by analyzing the relevance between the features and the categories and the redundancy between the features. Statistical methods are typically used to measure correlation and redundancy, including spatial distance measurements, consistency metrics, and the like. When high-dimensional unbalanced data are classified, the traditional machine learning classification algorithm can effectively classify the low-dimensional unbalanced data, the algorithms cannot adapt to high-dimensional and unbalanced data, the classification difficulty of the model is increased due to high-dimensional characteristics, and the imbalance of the data enables the classification algorithm to be biased to most samples, so that the classification accuracy of few samples is low. In practical application, few types of data often contain information with extremely high value. For example, in medical diagnosis, although the number of healthy people is far greater than that of cancer patients, people only care about diagnosis of cancer patients; in face recognition, although face recognition technology is developed, it is still impossible to effectively recognize a blurred face. A small number of classes are wrongly classified, inestimable loss is often caused, but the classification research on high-dimensional unbalanced data at present is difficult to achieve an ideal state, the high-dimensional problem or the unbalanced problem of the data is mostly solved, and although the classification effect of certain algorithms is good, the generalization capability of models is poor, and most situations cannot be adapted to.

Disclosure of Invention

The invention aims to provide a high-dimensional unbalanced data classification method based on feature learning and ensemble learning so as to improve the classification accuracy of high-dimensional unbalanced data.

The invention is realized by the following technical scheme:

the high-dimensional unbalanced data classification method based on feature learning and ensemble learning comprises the following steps:

s1: carrying out equalization processing on the original data set by using a mixed sampling algorithm to obtain an equalized data set;

the original data set is a set of a plurality of high-dimensional unbalanced data;

s2: carrying out feature selection on the equalized data set by using a feature selection algorithm to obtain an optimal feature set;

s3: and classifying the optimal feature set by using the trained Stacking ensemble learning model.

Preferably, the S1 includes the following substeps:

s11: removing noise samples in the original data set by using a NENN algorithm to obtain a noiseless data set;

s12: oversampling boundary samples in the noiseless data set to obtain an oversampled data set;

s13: under-sampling a plurality of types of samples in the noiseless data set to obtain an under-sampled data set;

s14: and merging the over-sampling data set and the under-sampling data set to obtain the equalized data set.

Preferably, the S12 includes the following substeps:

s121: acquiring a majority sample set A and a minority sample set B in the boundary samples;

s122: acquiring a majority class sample set A1 and a minority class sample set B1 in the minority class sample set B;

s123: merging the majority sample set A1 and the majority sample set A to obtain a majority sample boundary set;

s124: acquiring a few class samples B2 in a majority class sample boundary set;

s125: merging the minority sample set B1 and the minority sample set B2 to obtain a minority sample boundary set;

s126: randomly selecting two samples from the minority sample boundary set, and generating a new minority sample according to the following formula;

X_new＝x₁+random(0，1)*(y_i-x₂)i＝1，2，...，N；

wherein, X_newRepresenting newly generated minority class samples, x₁And x₂Respectively representing two samples randomly selected from a few classes of samples, y_iRepresenting the ith same-class sample randomly selected by k same-class nearest neighbors according to sampling multiplying power, wherein N is a natural number;

s127: repeating the step S126 n times to obtain the oversampled data set; n is a natural number.

Preferably, the S13 includes the following substeps:

s131: dividing most samples in the noiseless data set into K samples by using a K-means clustering algorithm, and obtaining the sample center of each sample;

s132: calculating the average sample number of the K samples and the average distance between the center of each sample and all samples;

s133: calculating the distance between the center of the sample and the sample in each class;

s134: and if the distance between the center of the sample and the sample exceeds the average distance and the number of samples in the class is less than the average number of samples, rejecting the sample to obtain the undersampled data set.

Preferably, the S2 includes the following substeps:

s21: removing irrelevant features and redundant features in the equalized data set to obtain a candidate feature subset;

s22: and performing accurate measurement on the features in the candidate feature subset by using a C4.5 classifier, and removing the features with reduced accuracy to obtain an optimal feature subset.

Preferably, the S21 includes the following substeps:

s211: calculating the MIC value of each feature and the category of each feature in the equalized data set;

s212: removing the features corresponding to the MIC values smaller than the MIC threshold value from the equalized data set to obtain a feature subset;

s213: calculating the SU value and the Pearson value of each feature and the rest of the features in the feature subset;

s214: calculating to obtain an average value of the corresponding characteristics according to the SU value and the Pearson value; :

s215: and removing the corresponding features of which the average values are higher than the threshold average value from the feature subset to obtain the candidate feature subset.

Preferably, the S22 includes the following substeps:

s221: obtaining a complementarity value of each feature in the candidate feature subset;

s222: performing descending order arrangement on the features in the candidate feature subset according to the magnitude of the corresponding complementarity value;

s223: according to the arrangement sequence, one characteristic is sequentially selected and subjected to accurate rate measurement by using a C4.5 algorithm;

s224: if the feature is such that the accuracy is increased or unchanged, retaining the feature; otherwise, the feature is removed from the candidate feature subset to obtain the optimal feature subset.

Preferably, the Stacking ensemble learning model comprises a base model layer and a meta model layer;

the basic model layer is used for pre-classifying the optimal feature set to obtain a prediction classification result;

and the element model layer is used for obtaining an actual classification result according to the prediction classification result.

Preferably, the base model layer comprises a support vector machine, a decision tree, a random forest and an adaptive enhancement.

Preferably, the meta-model layer comprises an extreme gradient lifting tree.

The scheme combines the high-dimensional and unbalanced characteristics of data, and researches are carried out from three aspects of mixed sampling, feature selection and integrated learning. First, in order to alleviate the unbalanced degree of data, a mixed sampling algorithm based on BN-SMOTE and KCUS is proposed. The result of a comparison experiment with other sampling algorithms shows that the algorithm has a good effect on the classification of the minority samples, but the integral classification accuracy is not ideal. Then, the redundancy and the correlation of the features are further analyzed, and the scheme provides a high-dimensional feature selection algorithm based on MIC and FCBF. In the Filter stage, removing irrelevant features and redundant features by using an MIC coefficient and an FCBF algorithm; in the Wrapper stage, selecting the features with high complementarity to add into the candidate feature subset, evaluating the feature subset according to the classification effect of the C4.5 classifier, selecting the optimal feature subset, and showing the superiority of the algorithm in feature selection through a comparison experiment. On the basis of mixed sampling and feature selection, in order to improve the generalization capability and classification accuracy of the model, a multi-model fusion algorithm based on Stacking ensemble learning is provided, an SVM (support vector machine), a decision tree, a random forest and an AdaBoost are selected as basic model classifiers, and an XGBoost is selected as a meta model classifier, the classifiers have large difference and good classification effect, the selection operation speed is high, and the stability and the accuracy of the fused model can be greatly improved.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. aiming at the problems of repeated, abnormal, redundant and other noises in a high-dimensional unbalanced data set, a new noise removal algorithm NENN is designed, aiming at the characteristic of unbalanced data distribution, a BN-SMOTE algorithm and a KCUS algorithm are combined, over-sampling is carried out on a few samples with fuzzy decision boundaries, new few samples are synthesized, most samples far away from a clustering center are removed based on a K-mean clustering principle, the unbalanced degree of data is relieved, and the accuracy of a classification model is improved;

2. according to the characteristic of high-dimensional feature of a high-dimensional unbalanced data set, a new high-dimensional feature selection algorithm HFS-MF is designed, irrelevant and redundant features are eliminated by adopting an MIC coefficient and an FCBF algorithm, and a C4.5 classifier is used for evaluating feature subsets to obtain an optimal feature subset with the highest accuracy;

3. the training framework based on Stacking ensemble learning selects an SVM, a decision tree, a random forest and an AdaBoost as classifiers of a base model layer, the classifiers have large difference and good classification effect, an XGboost algorithm with high operation speed and strong classification effect is selected as a meta-learner, and the stability and the accuracy of the fused model can be greatly improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a diagram of a Stacking ensemble learning model-based fusion framework according to the present invention;

FIG. 2 is a flow chart of the HSA-KSR algorithm of the present invention;

FIG. 3 is a sample distribution diagram of three types of the present invention;

FIG. 4 is a flow chart of the KCUS algorithm of the present invention;

FIG. 5 is a sample distribution diagram of a few classes according to the present invention;

FIG. 6 is a flow chart of the HFS-MF algorithm of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Examples

The high-dimensional unbalanced data classification method based on feature learning and ensemble learning, as shown in fig. 1, includes the following steps:

s1: performing equalization processing on the original data set by using a hybrid sampling algorithm to obtain an equalized data set, as shown in fig. 2;

the original data set in this embodiment is a set of multiple high-dimensional unbalanced data.

The problem of noisy data in the data set is inevitable, and if the noisy data is not processed, the difficulty and precision of classification are increased, especially in high-dimensional unbalanced data, a few boundary class samples and noisy samples sometimes overlap, so in this step, the whole original data set is subjected to noise sample preprocessing firstly.

Specifically, the data set includes three classes of Boundary samples (Boundary), Noise samples (Noise), and Safety samples (Safety). The boundary samples are samples at the decision boundary, the number of the samples is small, but a few types of samples at the boundary have decisive effect on classification; the noise samples are data mixed between correct samples or data located in a boundary region, and the noise samples interfere with the classification effect; the safety samples are samples outside the decision boundary, and the safety samples are large in number and have small influence on the boundary. As shown in fig. 3, the data outside the two dotted lines are safety samples, the samples on the dotted lines are boundary samples, and the samples between the dotted lines are noise samples.

In order to solve the influence brought by the Noise samples, according to the distribution space of the data set, the Noise samples are removed by adopting a Nearest Neighbor Noise removal algorithm (NENN) of the NENN algorithm. Since most of the distribution of noise sample data is concentrated in the boundary region or inside the sample data, K nearest neighbor samples of the sample points can be obtained by calculating the euclidean distance between each sample point, then the categories of the K nearest neighbor samples are analyzed, and if at least 2/3 different types of samples are included in the K nearest neighbor samples, the sample point can be divided into noise samples and deleted. The specific algorithm steps are as follows:

inputting: original data set S, nearest neighbor number K, noise sample set N [ ]

And (3) outputting: noiseless data set S_e

The process is as follows:

traversing the whole original data set S;

arbitrarily select sample point x_i∈S；

Calculating sample point x by using Euclidean distance formula_iK nearest neighbor samples in the neighborhood;

if at least one of the K nearest neighbor samples is present

Sample class and sample point x_iOtherwise, N is N + x_i；

S_e＝S-N；

returnS_e。

After the noise samples are removed, the small number of samples with high classification difficulty are oversampled through a BN-SMOTE oversampling algorithm so as to increase the number of the small number of samples.

The BN-SMOTE oversampling algorithm in the scheme is improved on the basis of Borderline-SMOTE, and the core idea is that samples are divided into three types of security samples, boundary samples and noise samples according to the Borderline-SMOTE algorithm idea, wherein the noise samples are removed by using an NENN algorithm before oversampling, so that the step only has two types of security samples and boundary samples. Continuously dividing the boundary samples into a majority sample A and a minority sample B, then calculating nearest neighbors of the minority sample B, continuously dividing the minority sample B into a minority sample B1 and a majority sample A1, performing union operation on the majority sample A and the majority sample A1 to obtain a majority boundary set, then continuously calculating nearest neighbor samples of the majority boundary set, continuously dividing the majority boundary set into a minority sample B2 and a majority sample B2, performing union operation on the minority sample B1 and a minority sample B2, thus obtaining a minority sample set with the hardest boundary classification, randomly selecting two samples from the minority sample set for N times, generating a new minority sample according to the following formula, thus expanding the number of the minority samples to obtain an oversampled sample set;

X_new＝x₁+random(0，1)*(y_i-x₂)i＝12，...，N；

wherein, X_newRepresenting newly generated minority class samples, x₁And x₂Respectively representing two samples randomly selected from a few classes of samples, y_iAnd the sampling rate of the ith homogeneous sample is determined according to the number of the minority samples which are synthesized according to actual needs.

The BN-SMOTE algorithm comprises the following specific steps:

And (3) outputting: oversampled sample set S_out

The process is as follows:

dividing the boundary samples into a majority class and a minority class;

for a few classes of samples x_iE is S, calculate x_iNearest neighbor majority class sample set S_max(x_i) K nearest samples are included;

to S_max(x_i) Performing union set operation to obtain a boundary set S of a plurality of types of samples_maxj，

For most kinds of samples y_i∈S_maxjCalculating y_iNearest neighbor minority class sample set S_imaxj(y_i) Containing K1 nearest samples

To S_imaxj(y_i) Performing union set operation to obtain the boundary set S which is most difficult to distinguish of the minority samples_min，

From S_minMedium random selectionSelecting two samples according to formula X_new＝x+random(0，1)*(y_i-X) i ═ 1, 2.., N_newIs mixing X_newIs added to S_outCirculating for N times;

return to S_out。

In the step, the BN-SMOTE algorithm is adopted for oversampling, compared with the traditional oversampling algorithm, the BN-SMOTE algorithm has obvious advantages in boundary decision, a few types of samples can be distinguished to the greatest extent, the algorithm is low in time complexity, the mode of selecting the samples is simple and effective, particularly the nearest neighbor parameter K can be determined, and the purpose of data balance can be achieved.

And finally, removing edge samples of most samples through a KCUS undersampling algorithm, and cleaning fuzzy boundary samples to enable the boundary to be clearer during classification.

The KCUS algorithm is an undersampling algorithm based on K-means clustering, and a core idea thereof is to select and delete samples according to distribution characteristics of a data set, from distribution of an entire data set, samples with the same category are generally concentrated together, the similarity is higher, the concentration degree is also higher, important information contained is also similar, the aggregation degree of edge samples which are not aggregated nearby and the category is low, and possibly have similarities with attributes of other categories nearby, in order to eliminate interference of classification, the outliers and the edge points are deleted, and it is ensured that sample points belonging to the same category are aggregated together, and a specific implementation process is shown in fig. 4.

As can be seen from fig. 4, the time complexity of the KCUS algorithm is relatively high, each time a sample center is selected, calculation needs to be performed with each sample point, and then the sample center with the highest aggregation degree is continuously selected. And the edge point is also selected by calculating the distance between each point and the center of the sample, then averaging all the distances, and if the distance between the sample point and the center of the sample exceeds the average distance and the number of samples is less than the average number of samples, selecting the sample point as the edge point and deleting the edge point.

Specifically, in the step, firstly, a K-means clustering algorithm is used for dividing most samples in a noiseless data set into K samples, and the sample center of each sample is obtained; secondly, calculating the average sample number of the K samples and the average distance between the center of each sample and all samples; then calculating the distance between the center of the sample in each class and the sample; if the distance between the sample center and the sample exceeds the average distance and the number of samples in the class is less than the average number of samples, the sample is rejected, otherwise, the sample is retained, thereby obtaining an under-sampled data set.

For the classification of unbalanced data, most researchers pay more attention to the problem that the distribution of the minority samples is unbalanced and the number of the minority samples is enlarged, because the minority samples have important classification value for the classification algorithm, especially for the samples which are located in the boundary region and have small number. Therefore, for the research on unbalanced data, how to ensure that few types of samples are not lost, interference is not generated, and data classification can be effectively performed is a key point and difficulty faced by the current oversampling technology. The oversampling and the undersampling enable the whole data set to reach a certain balance by increasing a few samples and reducing a plurality of samples, provide effective information for the learning and training of a subsequent classification algorithm, and improve the recognition rate. The algorithm itself has limitations that if only a single method is used, the overall number of samples will change, and both the newly added samples and the deleted samples may affect the distribution of the entire data set.

Through a large number of experiments, the distribution positions of a few types of samples are not just the following situations:

firstly, the data are distributed in the interior of most types of samples, and the situation generally directly treats the data as noise data and has no research value;

second, at the boundary between the majority and minority classes, which is the most common and complex case. If mishandled, it is easy to treat it as an interference term, which is directly deleted or mixed with other minority classes, and when oversampling is performed, interference is easily formed. (ii) a

Thirdly, the data are located at the edges of most classes and do not affect the whole data set, the data are generally judged by detecting the nearest neighbor class, if interference is not caused, sample synthesis is directly carried out by a proper method, and three distribution conditions of a few classes of samples are shown in fig. 5;

in order to solve the problem of few class samples at the boundary, the BN-SMOTE algorithm is adopted to perform new addition synthesis of the few class samples, the algorithm has the most outstanding advantage that the few class samples which are difficult to learn can be distinguished through several nearest neighbor divisions, particularly the few class samples which are positioned in a boundary decision area, the algorithm can effectively avoid the few class samples with rare quantity from being directly deleted, and the problems of overlapping, inter-class imbalance, overfitting and the like of the few class samples are effectively solved.

Aiming at the distribution characteristics of most types of data sets, most of the same type of samples are concentrated together, the scheme selects a K-means clustering undersampling algorithm (KCUS) based on the distribution characteristics of the samples, the algorithm firstly selects K sample centers from the samples at random, the samples are divided into K types by calculating the distance between each sample point and the sample center, and the sample center is recalculated until the sample center is determined. Secondly, the average distance between the sample center and each sample point is calculated, and if the distance between the sample point and the sample center exceeds the average distance and the number of the sample points is less than the average number of the sample points, the sample point is directly deleted.

The HSA-BSK mixed sampling algorithm is provided by combining the BN-SMOTE algorithm and the KCUS algorithm, a few types of samples at the boundary and a plurality of types of samples at the edge can be effectively processed, the imbalance of data is relieved, and the classification performance is improved. Meanwhile, the algorithm can overcome the defects of single over-sampling or under-sampling, the aim of data balance is fulfilled by newly adding a few samples and reducing a plurality of samples to carry out data reconstruction, and the improvement on the performance of the classifier is favorably influenced.

for the classification of high-dimensional data, excessive attributes thereof not only interfere with the classification result, but also greatly reduce the training efficiency of the classifier, so that the selection of the most representative features is the key of the classification of the high-dimensional data. According to the relationship between the features and the belonged categories, whether the features are related to the categories can be distinguished, and according to the correlation degree between the features, the features with strong correlation can be screened out, so that whether the features are redundant or not can be distinguished. The irrelevant feature means that the feature has no association with the final classification, and the redundant feature means that the feature has higher similarity with other features and part of information is repeated. Based on this, this embodiment proposes a High-dimensional Feature Selection algorithm (High-dimensional Feature Selection algorithm on MIC and FCBF, HFS-MF algorithm for short) based on MIC and FCBF. The algorithm is divided into two stages of Filter and Wrapper. In a Filter stage, an MIC correlation coefficient is used as an evaluation index, and irrelevant and weakly relevant features are filtered to obtain a candidate feature subset; and in the Wrapper stage, calculating the complementarity value of each feature in the candidate feature subset by using the complementarity as an evaluation index, and classifying the candidate feature subset through a classifier to obtain the optimal feature subset with the highest classification accuracy.

The MIC is the maximum information coefficient, and the most important problem to classify the high-dimensional data is how to reduce the dimensionality of the data, and a large number of attributes have some important relations, which cannot be found by naked eyes, but some statistical methods can be adopted to measure the attributes, find the relation between the attributes, and select the optimal attributes according to certain standards, so as to perform feature selection processing on the high-dimensional data. The statistical method needs to satisfy two principles of Adaptability (adaptivity) and fairness (equality), wherein fairness refers to adding noise to attributes different from the categories to which the statistical method belongs, and the statistical method can obtain the same value; adaptability refers to the ability to make statistical measurements of functional and non-functional relationships, linear, non-monotonic, and the like.

The Maximum Information Coefficient (MIC) can effectively measure the correlation between variables, and has the principles of adaptability and fairness. The core idea of MIC is to utilize Mutual Information (MI) and a mesh division method, if there is a correlation between two variables, the two variables are dispersed in a two-dimensional space and represented by a scatter diagram, the current two-dimensional space is divided into a certain number of intervals in x and y directions, and then the current scatter point falls into each square is checked, which is the calculation of joint probability, so that the problem that the joint probability in the mutual information is difficult to solve is solved, and the specific calculation is as follows:

wherein D is an ordered pair set, X represents to divide the value range of the feature f into X sections, XY < B (n) represents to have the grid number less than B (n), I (D, X, Y) represents to have MI maximum value under different grid divisions, ln (min (X, Y)) represents to normalize the maximum MI value.

The MIC method is provided on the basis of MI, the defect that MI cannot measure continuous variable relation is overcome, compared with other evaluation indexes, the MIC coefficient has better stability, universality and fairness, and the method is suitable for feature selection work of high-dimensional data.

FCBF is a fast associated filtering algorithm, this algorithm uses the heuristic backward sequential search mode to look for the optimal characteristic subset fast, adopt the Symmetric Uncertainty (SU) as the evaluation index, get the optimal characteristic every time and delete all redundant characteristics of this characteristic. SU is widely used to measure the closeness of two non-linear variables, such as the correlation between features and their categories or features, and specifically calculated as follows:

H(X)＝-∑_iP(x_i)log₂(P(x_i))；

H(X|Y)＝-∑_jP(y_i)∑P(x_i|y_i)log₂P(x_i|y_i)；

I(X|Y)＝H(X)-H(X|Y)＝H(Y)-H(Y|X)；

h (X) represents the information entropy of variable X, the value of which has uncertaintySex, P (x)_i) Represents the probability of an event X occurring at the ith time, H (X | Y) represents the conditional entropy, the degree of uncertainty of the variable X with respect to the known condition Y, i.e., the probability of X occurring when Y occurs alone, P (Y)_i) Representing the probability of occurrence of event Y at the ith time, P (x)_i|y_i) Indicating the probability of occurrence of event X under the condition of occurrence of event Y at the ith time, I (X | Y) indicating mutual information, measuring the degree of association of variables X and Y, namely, how much common information is possessed, SU (X, Y) measuring the degree of uncertainty of variables X and Y, SU being [0, 1]The larger the value is, the stronger the relevance is, the SU carries out normalization processing on MI, and the bias of IG in feature selection is overcome.

The degree of association of the measured characteristic i with the category c is recorded as SU_i，cAnd recording the correlation degree of the characteristic i and the characteristic j as SU_i，jThe FBCF algorithm is mainly divided into two stages: removing irrelevant features and removing redundant features. The SU value, namely the degree of correlation, of each feature and each category is calculated from the original feature set, the feature sets larger than a threshold value are screened, and the feature sets are arranged according to the SU value descending order. Then starting with the first of the feature subsets, if SU_i，c＞SU_i+1，iThen the features following the feature are directly culled and the process is cycled through until the feature set is empty. As can be seen from the analysis, if the relevance between the features and the categories is stronger, the possibility of retention is higher, and the relevance between the features is correspondingly higher.

Specifically, in this step, the MIC value of each feature and the category to which the feature belongs and the redundancy between the features are calculated first, and the irrelevant features and part of redundant features are filtered to obtain a candidate feature subset. Then, calculating complementarity of the candidate feature subsets, performing descending arrangement on the candidate feature sets, selecting a first feature each time to add into the optimal candidate subset, calculating classification accuracy rates before and after adding, if the accuracy rate is reduced, removing the feature, if the accuracy rate is increased or unchanged, adding the optimal feature subset, and circulating the process until the candidate feature subset is empty, wherein the HFS-MF algorithm is specifically implemented as follows, as shown in FIG. 6:

1) to eliminate the effect of the number of different features, all features in the equalized dataset are normalized (Min-Max normalization), i.e.: mapping the values into a [0, 1] range, wherein the specific formula is as follows:

wherein, X_i，j′Represents the normalized characteristic value, X, of the ith sample characteristic j in the equalized data set_i，jA feature value, X, representing the i-th sample feature j in the equalized data set_j，maxAnd X_j，minRespectively representing the maximum value and the minimum value of the characteristic j;

2) removing features with low correlation: calculating each characteristic and MIC value of the category in the equalized data set, eliminating the characteristic with the MIC value lower than the MIC threshold value, and then performing descending order arrangement according to the MIC value;

the setting mode of the MIC threshold is as follows:

σ＝0.4*(Max(M_f)-Min(M_f))；

where σ denotes the MIC threshold, Max (M)_f) And Min (M)_f) Respectively representing the maximum value and the minimum value of the correlation between the features and the categories, wherein the larger the difference between the maximum value and the minimum value, the more irrelevant and weakly correlated features exist in the original feature set, the larger the threshold value is, and the deleted features are increased;

3) eliminating the characteristic of high redundancy: according to the arrangement sequence, starting from the first feature, calculating the SU value and the Pearson value of each feature and other features, averaging the two correlation coefficients, and eliminating the features corresponding to the average value higher than the threshold value (the average value of the threshold value in the embodiment is set to be 0.7), thereby obtaining a candidate feature subset;

4) the first two steps can remove most irrelevant and redundant features, and then continue to remove some poor-performing features: the method comprises the steps of firstly calculating the complementarity value of each feature in a candidate feature subset, (the complementarity value is MIC value/average redundancy), sequencing the features in a descending order according to the magnitude of the complementarity value, then utilizing C4.5 algorithm to measure the accuracy, if the accuracy is reduced after the features are added, directly removing the features, and continuously measuring the next feature, thereby obtaining the optimal feature subset.

In this scheme, in order to obtain the optimal feature subset, the most significant feature subset has Complementarity (Complementarity), that is: the method has the advantages that the included features have larger Correlation (Correlation) with the belonged categories, and the features have smaller Redundancy (Redundancy), the MIC coefficient is used as an evaluation index for the Correlation measurement, the mean value of the Pearson coefficient and SU is used as an evaluation index for the Redundancy measurement, and the quotient of the Correlation and the Redundancy is used as an evaluation index for the complementarity measurement, and the specific calculation is as follows:

wherein, C_fDenotes complementarity, M_fDenotes the MIC coefficient, R_fIndicating redundancy, P_fDenotes Pearson coefficient, SU_fRepresenting the symmetry uncertainty, C_fThe importance of candidate features is measured from relevance and redundancy.

M_fThe larger the value and R_fThe smaller the value, the lower C_fThe larger the candidate feature subset is, the higher the correlation between the feature and the class to which the candidate feature subset belongs is, the lower the redundancy of the candidate feature subset is, and the stronger complementary effect is achieved.

In the prior art, a feature selection algorithm based on a Filter is independent from a classifier, a part of features irrelevant or weakly relevant to a category can be deleted quickly, the selected features have good generalization capability in different data and can adapt to various high-dimensional data, but the Filter only aims at the relation between a single feature and the category and ignores the interrelation between the features, so that important features can be deleted, and the accuracy of the final classification result is relatively low. The method is characterized in that the quality of a feature subset is evaluated through the classification precision of a classifier, partial redundant features can be screened, a better feature subset can be obtained, the final classification accuracy is high, but the method is characterized in that redundancy analysis is carried out on the features one by one, the adaptability of feature selection on high-dimensional data is poor, the efficiency of the algorithm is low, the time performance of classification on mass data is very poor, and the overfitting phenomenon is serious.

The FCBF algorithm is simple to implement, high in efficiency and capable of effectively deleting redundant features and irrelevant features, but the FCBF algorithm considers that the correlation between the features and the features is strong and then the features are regarded as the redundant features, the obtained optimal feature subset and the classification accuracy are not as good as those of the Relieff algorithm, and the adaptability of SU coefficients in the aspect of screening the irrelevant features is not strong as that of MIC coefficients.

In order to eliminate the characteristics which are not related or weakly related to the categories and the redundant characteristics among the characteristics and ensure the operation speed of the algorithm and the final classification effect, the scheme provides a high-dimensional characteristic selection algorithm based on MIC and FCBF, the algorithm utilizes the speed of a Filter algorithm, adopts the characteristic of high accuracy of a Wrapper algorithm, combines the measurement modes of MIC coefficients, SU coefficients and Pearson coefficients, efficiently deletes the irrelevant characteristics and the redundant characteristics to obtain an optimal characteristic subset, and compared with other single algorithms and algorithms of the same type, the method is more excellent in classification accuracy and stability.

S3: and classifying the optimal feature set by using a Stacking integration algorithm.

The Stacking integration algorithm comprises two layers of models: a base learner model and a meta learner model. In the scheme, the optimal feature subset outputs a prediction classification result through the training of the base learner, then the prediction classification result is used as the input of the meta-learner, and the final obtained result is the final classification result. In order to enable the Stacking integration algorithm to have the best prediction effect, the base learner needs to have the characteristics of high independence, large difference, strong learning capability and excellent classification performance, and the meta learner needs to meet the characteristics of stable performance, strong generalization performance and good classification effect. Therefore, in the present solution, the classifier of the base model layer selects a Support Vector Machine (SVM), a decision tree (C4.5), a Random Forest (RF) and an adaptive enhancement (AdaBoost) algorithm, and the meta model layer uses an extreme gradient boosting tree (XGBoost) algorithm as a meta learner, as shown in fig. 1.

The SVM algorithm has a good classification effect on linear divisible data, high-dimensional data can be simplified by introducing a kernel function, and the SVM algorithm also has good classification performance on nonlinear data; the C4.5 algorithm adopts the information gain rate as a classification index, enhances the recognition rate of a few types of samples, avoids the over-fitting problem, has good effect on unbalanced data classification, and introduces a pruning method, so that the interference of noise data can be reduced, the structure of a tree is simplified, and the classification accuracy is greatly improved; the random forest algorithm is composed of a plurality of decision trees, the scale of the data subsets is increased by adopting a self-service sampling mode, the difference of each learner is large, and the generalization capability of the model is strong; because the feature selection is not needed, the high-dimensional data can be effectively processed, and the importance degree of the features can be distinguished; the random forest is trained in a parallelization mode based on the Bagging idea, and the training speed of the model is extremely high. The AdaBoost algorithm learns in a recursion mode, reduces the weight of correctly classified samples, increases the weight of incorrectly classified samples, obtains larger weight for a small number of samples and samples difficult to classify, and has excellent performance when extremely unbalanced data sets are classified. The XGboost algorithm introduces a regular term to reduce the complexity of the model, carries out second-order Taylor expansion on the cost function during optimization, introduces a second-order derivative to prevent the overfitting problem of the model, adopts column sampling to improve the calculation speed of the model, and has strong effects on classification performance, operation speed, fault tolerance rate and stability.

A certain fusion strategy is needed when a plurality of learners are combined for use, and the combining strategies of the learners at present are mainly three types: voting, averaging, and learning.

The voting method is that a training set generates a class prediction result after being trained by an individual learner, for each prediction result, each individual learner can independently vote, and finally, the result with the highest vote is selected as the class prediction result, and the voting method can be subdivided into: absolute majority voting (maj ority voting), relative majority voting (plurality voting), and weighted voting. The idea of most votes is to classify a category if the number of votes for that category exceeds half and the number of votes is highest.

The relative majority voting refers to that if the number of votes of a certain category is the highest, the prediction result of the sample is used as a classification category, and if a plurality of labels with the highest number of votes exist, one label is directly selected randomly.

And the weighted voting refers to that if weighted assignment is carried out according to the importance degree of the learner before voting, the weight and the vote number are added to obtain the category with the highest score as the final category.

The averaging method is to take the average value of the prediction result of each individual learner and output the average value as the final prediction result, and the averaging method can be divided into a simple averaging method (simple averaging) and a weighted averaging method (weighted averaging). The simple average method regards all the individual learners as equally important, while the weighted average method assigns different weights according to different prediction results, so that the individual learners with better learning effect have higher importance.

The learning method is a stronger fusion strategy, mainly the prediction result of the individual learner is not directly processed by a voting method or an average method, but the output result of the individual learner is used as input, and training is continuously carried out by another strong learner. Therefore, in the embodiment, a learning method is adopted as a fusion strategy of the Stacking integration algorithm to improve the stability and generalization capability of the Stacking integration algorithm.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The high-dimensional unbalanced data classification method based on feature learning and ensemble learning is characterized by comprising the following steps of:

2. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 1, wherein the S1 includes the following sub-steps:

3. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 2, wherein the S12 includes the following sub-steps:

s124: acquiring a few class samples B2 in a majority class sample boundary set;

X_new＝x₁+random(0，1)*(y_i-x₂)i＝12，...，N；

4. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 2, wherein the S13 includes the following sub-steps:

5. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 1, wherein the S2 includes the following sub-steps:

6. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 5, wherein the S21 includes the following sub-steps:

7. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 5, wherein the S22 includes the following sub-steps:

8. The feature learning and ensemble learning-based high-dimensional imbalance data classification method according to claim 5, wherein the Stacking ensemble learning model comprises a base model layer and an element model layer;

9. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 8, wherein the base model layer comprises a support vector machine, a decision tree, a random forest and an adaptive enhancement.

10. The feature learning and ensemble learning based high-dimensional imbalance data classification method according to claim 8, wherein the meta-model layer comprises an extreme gradient lifting tree.