CN111400180B

CN111400180B - Software defect prediction method based on feature set division and ensemble learning

Info

Publication number: CN111400180B
Application number: CN202010177397.7A
Authority: CN
Inventors: 李璐璐; 任洪敏; 朱云龙; 卢晓喆
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-03-13
Filing date: 2020-03-13
Publication date: 2023-03-10
Anticipated expiration: 2040-03-13
Also published as: CN111400180A

Abstract

The invention discloses a software defect prediction method based on feature set division and ensemble learning, which comprises the steps of dividing an original data set into a training data set and a test data set, and dividing the training data set into a plurality of feature subsets; selecting K base classifiers for ensemble learning, and synthesizing an ensemble classifier of each feature subset according to the base classifiers and the corresponding weights; selecting a characteristic subset which is most similar to the input example, and performing defect prediction on the input example by using an integrated classifier of the characteristic subset to establish a software defect prediction model; dividing the test data set and searching a characteristic subset which is most similar to the input example; and searching the optimal values of the mass center set and the weight set, and optimizing the software defect prediction model by combining the most similar characteristic subset of the test data set. The advantages are that: the method can remove redundant features in the defect prediction data set, reduce the search space of the algorithm, and effectively relieve the problem of high dimensionality of the software defect historical data features.

Description

Software defect prediction method based on feature set division and ensemble learning

Technical Field

The invention relates to the technical field of software defect prediction, in particular to a software defect prediction method based on feature set division and ensemble learning.

Background

The purpose of software defect prediction is mainly to use the related technology to distinguish a software module as a defective module or a non-defective module through historical software defect information, so that software defect prediction is essentially a two-class problem. And defective modules can be effectively identified through defect prediction, so that various risks and hazards brought by software defects are reduced. At present, many machine learning algorithms are already used for building a prediction model, for example, a classification rule generated by a decision tree C4.5 algorithm is easy to understand and has high learning speed, and is often used as a reference comparison algorithm for model building; the model is constructed by adopting a naive Bayes algorithm, so that the sensitivity to the unbalance of class data is low, and the prediction performance is excellent; other machine learning algorithms such as neural networks, support vector machines, etc. are used to construct software defect prediction models, and good prediction performance is obtained in specific application fields.

However, in the real world, the software defect prediction model is affected by many adverse factors which always reduce the prediction accuracy or stability of the model, wherein the two most influential factors are the unbalance problem of classes in the data set and the high-dimensional problem of features in the data set.

The unbalanced class problem in data sets is mainly that in many software defect data sets the number of samples of a non-defective module is much larger than the number of samples of a defective module. In the actual modeling process, the conventional classifier may be biased toward a non-defective software module, so that the classifier may produce poor classification results for the defective module.

Scholars at home and abroad successively put forward various methods for processing the data imbalance problem, and at present, the methods can be mainly divided into a data level and an algorithm level. The data level decision is started from a data preparation stage, and mainly comprises various sampling methods for adjusting original unbalanced defect data so as to obtain balanced data. The algorithm level mainly comprises a cost sensitive learning method, a classification threshold value moving method and an integrated learning method. Although the data-level method can reduce the data imbalance problem to a certain extent, the data set needs to be preprocessed before modeling, and the calculation processing cost and the time cost of defect prediction are increased.

In addition to the data imbalance problem, the high dimensional problem of features within the defect data set is also an important factor affecting the complexity of defect prediction. In the defect prediction, a measurement vector formed by software measurement elements in a defect data set is used as input and is used as the input software measurement element, the data scale of the data set is huge, an original feature space corresponding to the data set always has high dimensionality and has a large amount of redundant data, and the difficulty of the software defect prediction is increased to a great extent.

Disclosure of Invention

The invention aims to provide a software defect prediction method based on feature set division and ensemble learning, which is used for dividing feature subsets of a data set, so that redundant features in the defect prediction data set can be removed, the search space of an algorithm is reduced, and the problem of high feature dimension of software defect historical data can be effectively solved; on the other hand, the method integrates the classification results of different base classifiers by adopting an integrated learning technology, so that the problem of low prediction precision of defective modules caused by unbalanced data set types can be effectively solved.

In order to achieve the purpose, the invention is realized by the following technical scheme:

a software defect prediction method based on feature set division and ensemble learning comprises the following steps:

s1, acquiring an original data set from historical software data, and dividing the original data set into a training data set and a testing data set;

s2, dividing the training data set into h mutually exclusive feature subsets, wherein each feature subset consists of a centroid C _h Representing that the set of centroids of all feature subsets in the training data set is a centroid set C;

s3, selecting K base classifiers for ensemble learning, wherein for one input example x, the base classifiers are respectively based on respective corresponding discriminant functions F _k (y _i X) classifying input instance x as y _i Class and use the weight w _k (y _i ) Discriminant function F representing kth base classifier _k (y _i X) weights of all weights w in the training data set _k (y _i ) Is a set of weights W, where y _i Is a category label;

s4, fusing the K base classifiers according to the K selected base classifiers and the weights corresponding to the K base classifiers respectively, and synthesizing an integrated classifier for each feature subset respectively;

s5, selecting a feature subset most similar to the input instance x, selecting an integrated classifier corresponding to the feature subset to carry out defect prediction on the input instance x, and establishing a software defect prediction model;

s6, dividing the test data set, repeating the operation from the step S3 to the step S4, and selecting a feature subset which is most similar to the input example x in the test data set;

and S7, searching the optimal values in the mass center set C in the training data set feature subset dividing process and the weight set W in the integrated learning process, and optimizing the software defect prediction model by combining the feature subsets in the test data set obtained in the step S6.

Preferably, in step S1, a ten-fold cross-validation method is adopted to divide the original data set into the training data set and the test data set.

Preferably, in step S2, the training data set is vertically divided into h mutually exclusive feature subsets, each of the divided feature subsets has the same number of samples as the training data set, and includes one feature subset of the original data set.

Preferably, in step S4, K base classifiers are integrated and fused by a weighted fusion method, where the integration rule is:

by using M ₁ ,M ₂ ,...,M _h Respectively representing the integrated classifiers corresponding to the h feature subsets, and then integrating the classifier M _h The classification decision rule is as follows:

then M _h ＝y _i Wherein Y is Y _i A collection of (a).

Preferably, in step S5, the feature subset most similar to the input instance x and the corresponding ensemble classifier are selected according to a distance metric.

Preferably, the feature subset index minH most similar to the input instance x is used as a return, and the integrated classifier corresponding to the feature subset is selected to perform software module defect prediction on the input instance x, so as to establish a software defect prediction model.

Preferably, in step S7, a genetic algorithm is used to search for optimal values in the centroid set C and the weight set W, and the software defect prediction model is optimized in combination with the feature subset in the test data set obtained in step S6.

Preferably, the step S7 specifically includes:

t1, setting various parameters in a genetic algorithm;

t2, coding chromosomes, wherein each chromosome individual consists of a centroid set C and a weight set W of the feature subset in the training data set, and binary coding is simultaneously carried out on the centroid set C and the weight set W;

t3, generating initial individuals according to the test data set, and randomly initializing the values of the initial individuals to generate an initial population;

t4, performing corresponding binary decoding operation according to the coding mode to obtain a feature subset division mode and a weight distribution mode of the base classifier;

t5, calculating the fitness value of the population individual, namely performing defect prediction by using a software defect prediction model according to the parameter value obtained by decoding, and obtaining the fitness value of the population individual according to a defect prediction result and a fitness function;

t6, selecting operation is carried out according to the individual fitness value of the population, a certain number of individuals are taken out of the population each time by using a championship selection strategy, then the best one of the individuals is selected to be added into a new population, and the operation is repeated until the new population size reaches the set population size;

t7, performing cross operation, namely randomly pairing the selected new population individuals pairwise, randomly determining two cross points in an individual code string by adopting a two-point cross strategy, and then performing partial gene exchange to form new two individuals to be added into the sub-population;

t8, performing mutation operation, namely determining a mutation point for each individual in the population based on the mutation probability, performing mutation operation, and adding a new individual obtained by mutation into the sub-population;

t9, performing the same decoding operation as the step T4 on the varied population individuals to obtain corresponding parameter values, performing defect prediction according to the defect prediction methods of the step T1 and the step T2, and calculating the fitness values of the varied population individuals after obtaining prediction results;

t10, judging whether iteration is terminated, if the fitness value of the population individual is not improved any more or reaches the maximum iteration times after multiple iterations, stopping the algorithm, and outputting the individual with the maximum fitness obtained in the evolution process as an optimal solution; otherwise, the genetic operation of the step T6 to the step T9 is repeatedly executed, and the individuals are continuously updated to obtain a new population until the termination condition is met.

Preferably, the various types of parameters in step T1 include: population size, cross probability, mutation probability and maximum iteration number.

Preferably, the step T2 is specifically:

setting the total coding length of an initial chromosome as a + b;

the front a-bit codes represent the division of the feature subsets, each 3-bit binary bit corresponds to the index of the feature subset where one metric element feature is located in the training data set, the index value range is (0, h), the number of the metric element features in the training data set is m, and the total length of the front a-bit codes is 3m;

the last b-bit code represents the division of the basis classifier weights, and the discriminant function used for representing one basis classifier in the feature subset by every 4-bit binary bit divides the input instance x into y _i Weight w of class _k (y _i ) Then each feature subset corresponds to a set of weights W _h The total length of the code is 4 × k × 2 bits, and the total length of the number of code bits allocated by all feature subset weights b =4 × k × 2 × h bits.

Compared with the prior art, the invention has the following advantages:

the software defect prediction method based on feature set division and ensemble learning fully utilizes the maximum correlation between different feature sets and the optimal classifier combination, optimizes the division of the feature sets and the weight distribution of the base classifiers in the discriminant function of the collective decision method at the same time, dynamically adjusts the data set division and the classifier weights, and can fully exert the local classification capability of the given base classifier; in addition, the method uses a genetic algorithm to search for an optimal solution, so that the method has good global search capability, can quickly search out the whole solution in a solution space, and cannot get into a quick descending trap of a local optimal solution; in addition, when the method is used for processing unbalanced data, a data-level method is not needed to modify a training set, and the extra cost of the algorithm in the aspect of time is greatly reduced.

Drawings

FIG. 1 is a frame diagram of a software defect prediction method based on feature set partitioning and ensemble learning according to the present invention;

FIG. 2 is a schematic process diagram of a software defect prediction method based on feature set partitioning and ensemble learning according to the present invention;

FIG. 3 is a schematic diagram of the vertical partitioning of the training data set according to the present invention;

FIG. 4 is a flow chart of a genetic algorithm customized to the software defect prediction problem of the present invention;

FIG. 5 is a schematic diagram of the total length of the coding sequence and the bit allocation of the genetic algorithm chromosome.

Detailed Description

The present invention will now be further described by way of the following detailed description of a preferred embodiment thereof, taken in conjunction with the accompanying drawings.

As shown in fig. 1 and fig. 2, a frame diagram and a process diagram of a software defect prediction method based on feature set partitioning and ensemble learning according to the present invention are shown, and the method includes:

s1, acquiring a software defect sample original data set D from historical software data, and dividing the original data set D into a training data set (TS) and a testing data set (VS).

The raw data set D = { (x) ₁ ,y ₁ ),...,(x _n ,y _n ) Is a set of samples containing n software defect modules, where x _n Is a vector of metric attributes for a software module n, each vector containing m degreesQuantity attribute (also called metric element), i.e. x _n ＝(a ₁ ,...,a _m )；y _n Epsilon and Y represent the class labels of the nth software modules, and in the invention, the software modules have only two classes, namely Y = { Y = ₁ ,y ₂ }，y ₁ Indicates a defect class, y ₂ Indicates no defect class, therefore y _n ＝y _i ＝y ₁ Or y ₂ 。

In this embodiment, a ten-fold cross-validation method is adopted to divide the original data set into the training data set and the testing data set, which are used for training and testing, respectively. The ten-fold cross validation is to divide the original data set into ten parts at random, nine parts of the original data set are taken as a training data set each time, the rest part of the original data set is taken as a test data set, and the process is repeated for 10 times to ensure that each part of data is used as the test data set at least once.

By carrying out feature set division on the predictable sample data set, the feature space complexity of the sample can be reduced as much as possible under the condition of ensuring that the accuracy rate of the classification algorithm is not reduced or the reduction degree is minimum, the optimal feature subset is selected, and the generalization capability and the algorithm efficiency of the model are improved.

S2, vertically dividing the training data set into h mutually exclusive feature subsets, wherein each feature subset consists of a centroid C _h To indicate that the set of centroids of all feature subset in the training data set is a centroid set C.

Wherein each of the feature subsets after division is the same as the number of samples of the training data set and comprises one feature subset of the original data set. As shown in FIG. 3, the training data set is divided vertically into h mutually exclusive feature subsets, each feature subset being defined by its centroid C _h To express, the centroids of all feature subsets constitute a centroid set C = { C = { C ₁ ,...,C _h Therein of

m represents the dimension of the feature subset, namely the measurement attribute, the mass center is defined by the physical mass center, and the average value of the sample feature is used as the mass center.

Generally, the ensemble learning technique is one of the important technical means for solving the class imbalance problem, and is very successful in processing the unbalanced data set, and the ensemble learning can obtain better classification effect and generalization capability than a single classifier, and an overfitting condition is not easy to occur. Therefore, the invention adopts an integrated learning mode to deal with the unbalanced problem.

And S3, selecting K base classifiers for ensemble learning, and setting weight distribution corresponding to the base classifiers. In order to obtain better integration performance, it is preferable to make K base classifiers different as much as possible, and the diversity of base classifiers can improve the classification accuracy.

For an input instance x (the input instance x is arbitrary), the base classifiers are respectively based on the respective corresponding discriminant functions F _k (y _i X) classifying input instance x as y _i Class (including defect class, non-defect class); within each feature subset, a weight w is employed _k (y _i ) Discriminant function F representing the kth base classifier _k (y _i X), then the weight assignment W corresponding to the h-th feature subset _h Is represented by W _h ＝[[w ₁ (y ₁ ),w ₁ (y ₂ )[,...,[w _k (y ₁ ),w _k (y ₂ )]] ^T (1) And satisfy

y _i Belongs to Y; h weights w in the training dataset _k (y _i ) Is the set of weight sets W, W = { W = { (W) ₁ ,...,W _h }。

And S4, fusing the K base classifiers according to the selected K base classifiers and the weights corresponding to the K base classifiers respectively, and synthesizing an integrated classifier for each feature subset respectively.

In this embodiment, according to the discriminant function and the weight distribution in step S3, K basis classifiers are integrated and fused in each feature subset by a weighted fusion method, and the integrated classifiers of each feature subset are respectively synthesized, where the integration rule is:

by using M ₁ ,M ₂ ,...,M _h Respectively representing the integrated classifiers corresponding to the h feature subsets, and then integrating the classifiers M _h The classification decision rule is as follows:

then M _h ＝y _i Wherein Y is Y _i A collection of (a).

S5, selecting a feature subset which is most similar to the input instance x, selecting an integrated classifier corresponding to the feature subset to carry out defect prediction on the input instance x, and establishing a software defect prediction model.

In this embodiment, the subset of features and corresponding ensemble classifier that is most similar to the input instance x are selected according to a distance metric method. The centroid distance size represents the degree of separation between the input instance x and the feature subset, with smaller distances indicating more similar samples, and the feature subset is selected by measuring the euclidean distance between the input instance x and the feature subset.

And (3) as shown in the following formula, returning a feature subset index minH which is most similar to (i.e. has the smallest distance to) the input instance x, selecting an integrated classifier corresponding to the feature subset to perform software module defect prediction on the input instance x, and establishing a software defect prediction model.

And S6, dividing the test data set, repeating the operation from the step S3 to the step S4, and selecting the feature subset which is most similar to the input example x in the test data set. In this embodiment, the test data set is divided into data sets and the most similar feature subset is selected by a method similar to the training data set.

And S7, searching the optimal values in the mass center set C in the training data set feature subset dividing process and the weight set W in the integrated learning process by adopting a genetic algorithm, and optimizing a software defect prediction model by combining the feature subsets in the test data set obtained in the step S6.

After the software defect prediction model is established, the defect prediction model can be optimized by finding the optimal parameters in the model, and the purpose of minimizing the model prediction error is achieved. In the invention, the optimal values of the centroid set C in the process of dividing the characteristic subset of the training data set and the weight set W in the process of integrated learning are searched simultaneously by using a genetic algorithm. The prediction error is calculated here using the standard mean square error, and the objective function and fitness function of the genetic algorithm are:

(y _n class label for the nth sample) to minimize the result of the objective function as the final optimization objective.

As shown in fig. 4, the step S7 specifically includes:

t1, setting various parameters in the genetic algorithm. In this embodiment, the various parameters in step T1 include: population size, cross probability, mutation probability, maximum iteration number and the like.

And T2, coding the chromosome. Each chromosome individual is composed of a centroid set C and a weight set W of the feature subset in the training data set, and binary coding is simultaneously carried out on the centroid set C and the weight set W.

As shown in fig. 5, the step T2 specifically includes: (1) setting the total coding length of an initial chromosome as a + b; (2) The front a-bit codes represent the division of the feature subsets, each 3-bit binary bit corresponds to the index of the feature subset where one measurement element feature (namely measurement attribute) in the training data set is located, the index value range is (0, h), the number of the measurement element features in the training data set is m, and the total length of the front a-bit codes is 3m; (3) The last b-bit code represents the division of the weight of the base classifier, and the discriminant function of each 4-bit binary bit for representing one base classifier in the feature subset divides the input instance x into y _i Weight w of class _k (y _i ) Then each feature subset corresponds to a set of weights W _h The total length of the code is 4 × k × 2 bits, and the total length of the number of code bits allocated by all feature subset weights b =4 × k × 2 × h bits.

And T3, generating initial individuals according to the test data set, and randomly initializing the values of the initial individuals to generate an initial population.

And T4, carrying out corresponding binary decoding operation according to the coding mode to obtain a feature subset division mode and a weight distribution mode of the base classifier.

The step T4 is specifically as follows: dividing the feature subset according to the feature subset division mode obtained by decoding, representing the feature subset by using the corresponding centroid, and adding each weight w _k (y _i ) Conversion of the corresponding 4-bit binary number into a decimal integer Q _k (y _i ) Each weight w is calculated using a corresponding decimal integer _k (y _i ) Corresponding weight

And T5, calculating the fitness value of the population individual, namely performing defect prediction by using a software defect prediction model according to the parameter value obtained by decoding, and obtaining the fitness value of the population individual according to a defect prediction result and a fitness function.

And T6, carrying out selection operation according to the individual fitness value of the population, using a tournament selection strategy, taking out a certain number of individuals from the population each time, then selecting the best one of the individuals to add into a new population, and repeating the operation until the new population reaches the set population size.

And T7, performing cross operation, namely randomly pairing the selected new population individuals two by two, randomly determining two cross points in an individual code string by adopting a two-point cross strategy, and then performing partial gene exchange to form new two individuals to be added into the sub population.

And T8, performing mutation operation, namely determining a mutation point for each individual in the population based on the mutation probability, performing mutation operation, and adding a new individual obtained by mutation into the sub-population.

t10, judging whether iteration is terminated, if the fitness value of the population individual is not improved any more or reaches the maximum iteration times after multiple iterations, stopping the algorithm, and outputting the individual with the maximum fitness obtained in the evolution process as an optimal solution; otherwise, the genetic operation of the step T6 to the step T9 is repeatedly executed, and the individuals are continuously updated to obtain new populations until the termination condition is met.

In summary, the software defect prediction method based on feature set division and ensemble learning of the present invention obtains an original data set from historical software data, and divides the original data set into a training data set and a testing data set; vertically dividing a training data set into a plurality of mutually exclusive feature subsets, each feature subset being represented by its centroid; selecting K base classifiers for ensemble learning, classifying the input examples into defective classes or non-defective classes according to corresponding discriminant functions by the base classifiers for a given input example, and assigning corresponding weights to the discriminant functions of the base classifiers; according to the selected base classifier and the corresponding weight thereof, fusing the base classifier in each feature subset through a weighting method, and synthesizing a set classifier of each feature subset; selecting a characteristic subset which is most similar to the input example according to a distance measurement method, and selecting an integrated classifier corresponding to the characteristic subset to carry out defect prediction on the input example, thereby establishing a software defect prediction model; dividing the test data set and searching a characteristic subset which is most similar to the input example; and simultaneously searching the optimal values of the mass center set in the characteristic subset dividing process and the weight set in the integrated learning process by utilizing a genetic algorithm, and optimizing the software defect prediction model by combining the most similar characteristic subset selected from the test data set to fulfill the aim of minimizing the prediction error of the defect prediction model. The method makes full use of the maximum association between different feature sets and the optimal classifier combination, optimizes the division of feature sets and the weight distribution of the base classifiers in the discriminant function of the collective decision method at the same time, can give full play to the local classification capability of the given base classifier, improves the prediction accuracy of the software defect prediction model, and can reduce the influence of the data imbalance problem and the software feature high-dimensional problem on the performance of the prediction model in the software defect prediction process.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A software defect prediction method based on feature set division and ensemble learning is characterized by comprising the following steps:

s3, selecting K base classifiers for integrated learning, wherein for one input example x, the base classifiers are respectively based on respective corresponding discriminant functions F _k (y _i X) classifying input instance x as y _i Class and use the weight w _k (y _i ) Discriminant function F representing kth base classifier _k (y _i X) weights of all weights w in the training data set _k (y _i ) Is a set of weights W, where y _i Is a category label;

s5, selecting a feature subset which is most similar to the input instance x, selecting an integrated classifier corresponding to the feature subset to carry out defect prediction on the input instance x, and establishing a software defect prediction model;

and S7, searching the optimal values in the mass center set C in the training data set feature subset dividing process and the weight set W in the integrated learning process, and optimizing a software defect prediction model by combining the feature subset in the test data set obtained in the step S6.

2. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,

in the step S1, a ten-fold cross-validation method is adopted to divide the original data set into the training data set and the test data set.

3. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,

in step S2, the training data set is vertically divided into h mutually exclusive feature subsets, each of the divided feature subsets has the same number of samples as the training data set, and includes one feature subset of the original data set.

4. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,

in the step S4, K base classifiers are integrated and fused by a weighted fusion method, and the integration rule is as follows:

then M _h ＝y _i Wherein Y is Y _i A collection of (a).

5. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,

in step S5, the feature subset most similar to the input instance x and the corresponding ensemble classifier are selected according to the distance metric.

6. The software defect prediction method based on feature set partitioning and ensemble learning of claim 5,

and taking the feature subset index minH which is most similar to the input instance x as a return, selecting an integrated classifier corresponding to the feature subset to carry out software module defect prediction on the input instance x, and establishing a software defect prediction model.

7. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,

in the step S7, the optimal values in the centroid set C and the weight set W are searched by adopting a genetic algorithm, and a software defect prediction model is optimized by combining the characteristic subset in the test data set obtained in the step S6.

8. The software defect prediction method based on feature set partitioning and ensemble learning according to claim 7, wherein said step S7 specifically comprises:

t1, setting various parameters in a genetic algorithm;

t3, generating an initial individual according to the test data set, and randomly initializing the value of the initial individual to generate an initial population;

t6, carrying out selection operation according to the individual fitness value of the population, using a championship selection strategy, taking out a certain number of individuals from the population each time, then selecting the best one of the individuals to add into a new population, and repeating the operation until the new population reaches the set population size;

t9, performing decoding operation on the varied population individuals as same as the decoding operation in the step T4 to obtain corresponding parameter values, performing defect prediction according to the defect prediction methods in the step T1 and the step T2, and calculating the fitness values of the varied population individuals after obtaining prediction results;

9. The software defect prediction method based on feature set partitioning and ensemble learning of claim 8,

the various parameters in step T1 include: population size, cross probability, mutation probability and maximum iteration number.

10. The software defect prediction method based on feature set partitioning and ensemble learning according to claim 8 or 9, wherein the step T2 specifically includes:

setting the total coding length of an initial chromosome as a + b;

the front a bit codes represent the division of the feature subsets, each 3 bit binary bit corresponds to the index of the feature subset where one measurement element feature is located in the training data set, the index value range is (0, h), the number of the measurement element features in the training data set is m, and the total code length of the front a bit is 3m;

the post-b-bit code represents the division of the basis classifier weights, and the discriminant function used every 4 bits to represent a basis classifier in the feature subset divides the input instance x into y _i Weight w of class _k (y _i ) Then each feature subset corresponds to a set of weights W _h The total length of coding is 4 × k × 2 bits, and the total length of the number of coding bits allocated by all feature subset weights b =4 × k × 2 × h bits.