CN111400180B - Software defect prediction method based on feature set division and ensemble learning - Google Patents

Software defect prediction method based on feature set division and ensemble learning Download PDF

Info

Publication number
CN111400180B
CN111400180B CN202010177397.7A CN202010177397A CN111400180B CN 111400180 B CN111400180 B CN 111400180B CN 202010177397 A CN202010177397 A CN 202010177397A CN 111400180 B CN111400180 B CN 111400180B
Authority
CN
China
Prior art keywords
data set
feature
defect prediction
population
software defect
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010177397.7A
Other languages
Chinese (zh)
Other versions
CN111400180A (en
Inventor
李璐璐
任洪敏
朱云龙
卢晓喆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Maritime University
Original Assignee
Shanghai Maritime University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Maritime University filed Critical Shanghai Maritime University
Priority to CN202010177397.7A priority Critical patent/CN111400180B/en
Publication of CN111400180A publication Critical patent/CN111400180A/en
Application granted granted Critical
Publication of CN111400180B publication Critical patent/CN111400180B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Physiology (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a software defect prediction method based on feature set division and ensemble learning, which comprises the steps of dividing an original data set into a training data set and a test data set, and dividing the training data set into a plurality of feature subsets; selecting K base classifiers for ensemble learning, and synthesizing an ensemble classifier of each feature subset according to the base classifiers and the corresponding weights; selecting a characteristic subset which is most similar to the input example, and performing defect prediction on the input example by using an integrated classifier of the characteristic subset to establish a software defect prediction model; dividing the test data set and searching a characteristic subset which is most similar to the input example; and searching the optimal values of the mass center set and the weight set, and optimizing the software defect prediction model by combining the most similar characteristic subset of the test data set. The advantages are that: the method can remove redundant features in the defect prediction data set, reduce the search space of the algorithm, and effectively relieve the problem of high dimensionality of the software defect historical data features.

Description

Software defect prediction method based on feature set division and ensemble learning
Technical Field
The invention relates to the technical field of software defect prediction, in particular to a software defect prediction method based on feature set division and ensemble learning.
Background
The purpose of software defect prediction is mainly to use the related technology to distinguish a software module as a defective module or a non-defective module through historical software defect information, so that software defect prediction is essentially a two-class problem. And defective modules can be effectively identified through defect prediction, so that various risks and hazards brought by software defects are reduced. At present, many machine learning algorithms are already used for building a prediction model, for example, a classification rule generated by a decision tree C4.5 algorithm is easy to understand and has high learning speed, and is often used as a reference comparison algorithm for model building; the model is constructed by adopting a naive Bayes algorithm, so that the sensitivity to the unbalance of class data is low, and the prediction performance is excellent; other machine learning algorithms such as neural networks, support vector machines, etc. are used to construct software defect prediction models, and good prediction performance is obtained in specific application fields.
However, in the real world, the software defect prediction model is affected by many adverse factors which always reduce the prediction accuracy or stability of the model, wherein the two most influential factors are the unbalance problem of classes in the data set and the high-dimensional problem of features in the data set.
The unbalanced class problem in data sets is mainly that in many software defect data sets the number of samples of a non-defective module is much larger than the number of samples of a defective module. In the actual modeling process, the conventional classifier may be biased toward a non-defective software module, so that the classifier may produce poor classification results for the defective module.
Scholars at home and abroad successively put forward various methods for processing the data imbalance problem, and at present, the methods can be mainly divided into a data level and an algorithm level. The data level decision is started from a data preparation stage, and mainly comprises various sampling methods for adjusting original unbalanced defect data so as to obtain balanced data. The algorithm level mainly comprises a cost sensitive learning method, a classification threshold value moving method and an integrated learning method. Although the data-level method can reduce the data imbalance problem to a certain extent, the data set needs to be preprocessed before modeling, and the calculation processing cost and the time cost of defect prediction are increased.
In addition to the data imbalance problem, the high dimensional problem of features within the defect data set is also an important factor affecting the complexity of defect prediction. In the defect prediction, a measurement vector formed by software measurement elements in a defect data set is used as input and is used as the input software measurement element, the data scale of the data set is huge, an original feature space corresponding to the data set always has high dimensionality and has a large amount of redundant data, and the difficulty of the software defect prediction is increased to a great extent.
Disclosure of Invention
The invention aims to provide a software defect prediction method based on feature set division and ensemble learning, which is used for dividing feature subsets of a data set, so that redundant features in the defect prediction data set can be removed, the search space of an algorithm is reduced, and the problem of high feature dimension of software defect historical data can be effectively solved; on the other hand, the method integrates the classification results of different base classifiers by adopting an integrated learning technology, so that the problem of low prediction precision of defective modules caused by unbalanced data set types can be effectively solved.
In order to achieve the purpose, the invention is realized by the following technical scheme:
a software defect prediction method based on feature set division and ensemble learning comprises the following steps:
s1, acquiring an original data set from historical software data, and dividing the original data set into a training data set and a testing data set;
s2, dividing the training data set into h mutually exclusive feature subsets, wherein each feature subset consists of a centroid C h Representing that the set of centroids of all feature subsets in the training data set is a centroid set C;
s3, selecting K base classifiers for ensemble learning, wherein for one input example x, the base classifiers are respectively based on respective corresponding discriminant functions F k (y i X) classifying input instance x as y i Class and use the weight w k (y i ) Discriminant function F representing kth base classifier k (y i X) weights of all weights w in the training data set k (y i ) Is a set of weights W, where y i Is a category label;
s4, fusing the K base classifiers according to the K selected base classifiers and the weights corresponding to the K base classifiers respectively, and synthesizing an integrated classifier for each feature subset respectively;
s5, selecting a feature subset most similar to the input instance x, selecting an integrated classifier corresponding to the feature subset to carry out defect prediction on the input instance x, and establishing a software defect prediction model;
s6, dividing the test data set, repeating the operation from the step S3 to the step S4, and selecting a feature subset which is most similar to the input example x in the test data set;
and S7, searching the optimal values in the mass center set C in the training data set feature subset dividing process and the weight set W in the integrated learning process, and optimizing the software defect prediction model by combining the feature subsets in the test data set obtained in the step S6.
Preferably, in step S1, a ten-fold cross-validation method is adopted to divide the original data set into the training data set and the test data set.
Preferably, in step S2, the training data set is vertically divided into h mutually exclusive feature subsets, each of the divided feature subsets has the same number of samples as the training data set, and includes one feature subset of the original data set.
Preferably, in step S4, K base classifiers are integrated and fused by a weighted fusion method, where the integration rule is:
Figure BDA0002411268140000031
by using M 1 ,M 2 ,...,M h Respectively representing the integrated classifiers corresponding to the h feature subsets, and then integrating the classifier M h The classification decision rule is as follows:
Figure BDA0002411268140000032
then M h =y i Wherein Y is Y i A collection of (a).
Preferably, in step S5, the feature subset most similar to the input instance x and the corresponding ensemble classifier are selected according to a distance metric.
Preferably, the feature subset index minH most similar to the input instance x is used as a return, and the integrated classifier corresponding to the feature subset is selected to perform software module defect prediction on the input instance x, so as to establish a software defect prediction model.
Preferably, in step S7, a genetic algorithm is used to search for optimal values in the centroid set C and the weight set W, and the software defect prediction model is optimized in combination with the feature subset in the test data set obtained in step S6.
Preferably, the step S7 specifically includes:
t1, setting various parameters in a genetic algorithm;
t2, coding chromosomes, wherein each chromosome individual consists of a centroid set C and a weight set W of the feature subset in the training data set, and binary coding is simultaneously carried out on the centroid set C and the weight set W;
t3, generating initial individuals according to the test data set, and randomly initializing the values of the initial individuals to generate an initial population;
t4, performing corresponding binary decoding operation according to the coding mode to obtain a feature subset division mode and a weight distribution mode of the base classifier;
t5, calculating the fitness value of the population individual, namely performing defect prediction by using a software defect prediction model according to the parameter value obtained by decoding, and obtaining the fitness value of the population individual according to a defect prediction result and a fitness function;
t6, selecting operation is carried out according to the individual fitness value of the population, a certain number of individuals are taken out of the population each time by using a championship selection strategy, then the best one of the individuals is selected to be added into a new population, and the operation is repeated until the new population size reaches the set population size;
t7, performing cross operation, namely randomly pairing the selected new population individuals pairwise, randomly determining two cross points in an individual code string by adopting a two-point cross strategy, and then performing partial gene exchange to form new two individuals to be added into the sub-population;
t8, performing mutation operation, namely determining a mutation point for each individual in the population based on the mutation probability, performing mutation operation, and adding a new individual obtained by mutation into the sub-population;
t9, performing the same decoding operation as the step T4 on the varied population individuals to obtain corresponding parameter values, performing defect prediction according to the defect prediction methods of the step T1 and the step T2, and calculating the fitness values of the varied population individuals after obtaining prediction results;
t10, judging whether iteration is terminated, if the fitness value of the population individual is not improved any more or reaches the maximum iteration times after multiple iterations, stopping the algorithm, and outputting the individual with the maximum fitness obtained in the evolution process as an optimal solution; otherwise, the genetic operation of the step T6 to the step T9 is repeatedly executed, and the individuals are continuously updated to obtain a new population until the termination condition is met.
Preferably, the various types of parameters in step T1 include: population size, cross probability, mutation probability and maximum iteration number.
Preferably, the step T2 is specifically:
setting the total coding length of an initial chromosome as a + b;
the front a-bit codes represent the division of the feature subsets, each 3-bit binary bit corresponds to the index of the feature subset where one metric element feature is located in the training data set, the index value range is (0, h), the number of the metric element features in the training data set is m, and the total length of the front a-bit codes is 3m;
the last b-bit code represents the division of the basis classifier weights, and the discriminant function used for representing one basis classifier in the feature subset by every 4-bit binary bit divides the input instance x into y i Weight w of class k (y i ) Then each feature subset corresponds to a set of weights W h The total length of the code is 4 × k × 2 bits, and the total length of the number of code bits allocated by all feature subset weights b =4 × k × 2 × h bits.
Compared with the prior art, the invention has the following advantages:
the software defect prediction method based on feature set division and ensemble learning fully utilizes the maximum correlation between different feature sets and the optimal classifier combination, optimizes the division of the feature sets and the weight distribution of the base classifiers in the discriminant function of the collective decision method at the same time, dynamically adjusts the data set division and the classifier weights, and can fully exert the local classification capability of the given base classifier; in addition, the method uses a genetic algorithm to search for an optimal solution, so that the method has good global search capability, can quickly search out the whole solution in a solution space, and cannot get into a quick descending trap of a local optimal solution; in addition, when the method is used for processing unbalanced data, a data-level method is not needed to modify a training set, and the extra cost of the algorithm in the aspect of time is greatly reduced.
Drawings
FIG. 1 is a frame diagram of a software defect prediction method based on feature set partitioning and ensemble learning according to the present invention;
FIG. 2 is a schematic process diagram of a software defect prediction method based on feature set partitioning and ensemble learning according to the present invention;
FIG. 3 is a schematic diagram of the vertical partitioning of the training data set according to the present invention;
FIG. 4 is a flow chart of a genetic algorithm customized to the software defect prediction problem of the present invention;
FIG. 5 is a schematic diagram of the total length of the coding sequence and the bit allocation of the genetic algorithm chromosome.
Detailed Description
The present invention will now be further described by way of the following detailed description of a preferred embodiment thereof, taken in conjunction with the accompanying drawings.
As shown in fig. 1 and fig. 2, a frame diagram and a process diagram of a software defect prediction method based on feature set partitioning and ensemble learning according to the present invention are shown, and the method includes:
s1, acquiring a software defect sample original data set D from historical software data, and dividing the original data set D into a training data set (TS) and a testing data set (VS).
The raw data set D = { (x) 1 ,y 1 ),...,(x n ,y n ) Is a set of samples containing n software defect modules, where x n Is a vector of metric attributes for a software module n, each vector containing m degreesQuantity attribute (also called metric element), i.e. x n =(a 1 ,...,a m );y n Epsilon and Y represent the class labels of the nth software modules, and in the invention, the software modules have only two classes, namely Y = { Y = 1 ,y 2 },y 1 Indicates a defect class, y 2 Indicates no defect class, therefore y n =y i =y 1 Or y 2
In this embodiment, a ten-fold cross-validation method is adopted to divide the original data set into the training data set and the testing data set, which are used for training and testing, respectively. The ten-fold cross validation is to divide the original data set into ten parts at random, nine parts of the original data set are taken as a training data set each time, the rest part of the original data set is taken as a test data set, and the process is repeated for 10 times to ensure that each part of data is used as the test data set at least once.
By carrying out feature set division on the predictable sample data set, the feature space complexity of the sample can be reduced as much as possible under the condition of ensuring that the accuracy rate of the classification algorithm is not reduced or the reduction degree is minimum, the optimal feature subset is selected, and the generalization capability and the algorithm efficiency of the model are improved.
S2, vertically dividing the training data set into h mutually exclusive feature subsets, wherein each feature subset consists of a centroid C h To indicate that the set of centroids of all feature subset in the training data set is a centroid set C.
Wherein each of the feature subsets after division is the same as the number of samples of the training data set and comprises one feature subset of the original data set. As shown in FIG. 3, the training data set is divided vertically into h mutually exclusive feature subsets, each feature subset being defined by its centroid C h To express, the centroids of all feature subsets constitute a centroid set C = { C = { C 1 ,...,C h Therein of
Figure BDA0002411268140000061
m represents the dimension of the feature subset, namely the measurement attribute, the mass center is defined by the physical mass center, and the average value of the sample feature is used as the mass center.
Generally, the ensemble learning technique is one of the important technical means for solving the class imbalance problem, and is very successful in processing the unbalanced data set, and the ensemble learning can obtain better classification effect and generalization capability than a single classifier, and an overfitting condition is not easy to occur. Therefore, the invention adopts an integrated learning mode to deal with the unbalanced problem.
And S3, selecting K base classifiers for ensemble learning, and setting weight distribution corresponding to the base classifiers. In order to obtain better integration performance, it is preferable to make K base classifiers different as much as possible, and the diversity of base classifiers can improve the classification accuracy.
For an input instance x (the input instance x is arbitrary), the base classifiers are respectively based on the respective corresponding discriminant functions F k (y i X) classifying input instance x as y i Class (including defect class, non-defect class); within each feature subset, a weight w is employed k (y i ) Discriminant function F representing the kth base classifier k (y i X), then the weight assignment W corresponding to the h-th feature subset h Is represented by W h =[[w 1 (y 1 ),w 1 (y 2 )[,...,[w k (y 1 ),w k (y 2 )]] T (1) And satisfy
Figure BDA0002411268140000062
y i Belongs to Y; h weights w in the training dataset k (y i ) Is the set of weight sets W, W = { W = { (W) 1 ,...,W h }。
And S4, fusing the K base classifiers according to the selected K base classifiers and the weights corresponding to the K base classifiers respectively, and synthesizing an integrated classifier for each feature subset respectively.
In this embodiment, according to the discriminant function and the weight distribution in step S3, K basis classifiers are integrated and fused in each feature subset by a weighted fusion method, and the integrated classifiers of each feature subset are respectively synthesized, where the integration rule is:
Figure BDA0002411268140000071
by using M 1 ,M 2 ,...,M h Respectively representing the integrated classifiers corresponding to the h feature subsets, and then integrating the classifiers M h The classification decision rule is as follows:
Figure BDA0002411268140000072
then M h =y i Wherein Y is Y i A collection of (a).
S5, selecting a feature subset which is most similar to the input instance x, selecting an integrated classifier corresponding to the feature subset to carry out defect prediction on the input instance x, and establishing a software defect prediction model.
In this embodiment, the subset of features and corresponding ensemble classifier that is most similar to the input instance x are selected according to a distance metric method. The centroid distance size represents the degree of separation between the input instance x and the feature subset, with smaller distances indicating more similar samples, and the feature subset is selected by measuring the euclidean distance between the input instance x and the feature subset.
And (3) as shown in the following formula, returning a feature subset index minH which is most similar to (i.e. has the smallest distance to) the input instance x, selecting an integrated classifier corresponding to the feature subset to perform software module defect prediction on the input instance x, and establishing a software defect prediction model.
Figure BDA0002411268140000073
And S6, dividing the test data set, repeating the operation from the step S3 to the step S4, and selecting the feature subset which is most similar to the input example x in the test data set. In this embodiment, the test data set is divided into data sets and the most similar feature subset is selected by a method similar to the training data set.
And S7, searching the optimal values in the mass center set C in the training data set feature subset dividing process and the weight set W in the integrated learning process by adopting a genetic algorithm, and optimizing a software defect prediction model by combining the feature subsets in the test data set obtained in the step S6.
After the software defect prediction model is established, the defect prediction model can be optimized by finding the optimal parameters in the model, and the purpose of minimizing the model prediction error is achieved. In the invention, the optimal values of the centroid set C in the process of dividing the characteristic subset of the training data set and the weight set W in the process of integrated learning are searched simultaneously by using a genetic algorithm. The prediction error is calculated here using the standard mean square error, and the objective function and fitness function of the genetic algorithm are:
Figure BDA0002411268140000074
(y n class label for the nth sample) to minimize the result of the objective function as the final optimization objective.
As shown in fig. 4, the step S7 specifically includes:
t1, setting various parameters in the genetic algorithm. In this embodiment, the various parameters in step T1 include: population size, cross probability, mutation probability, maximum iteration number and the like.
And T2, coding the chromosome. Each chromosome individual is composed of a centroid set C and a weight set W of the feature subset in the training data set, and binary coding is simultaneously carried out on the centroid set C and the weight set W.
As shown in fig. 5, the step T2 specifically includes: (1) setting the total coding length of an initial chromosome as a + b; (2) The front a-bit codes represent the division of the feature subsets, each 3-bit binary bit corresponds to the index of the feature subset where one measurement element feature (namely measurement attribute) in the training data set is located, the index value range is (0, h), the number of the measurement element features in the training data set is m, and the total length of the front a-bit codes is 3m; (3) The last b-bit code represents the division of the weight of the base classifier, and the discriminant function of each 4-bit binary bit for representing one base classifier in the feature subset divides the input instance x into y i Weight w of class k (y i ) Then each feature subset corresponds to a set of weights W h The total length of the code is 4 × k × 2 bits, and the total length of the number of code bits allocated by all feature subset weights b =4 × k × 2 × h bits.
And T3, generating initial individuals according to the test data set, and randomly initializing the values of the initial individuals to generate an initial population.
And T4, carrying out corresponding binary decoding operation according to the coding mode to obtain a feature subset division mode and a weight distribution mode of the base classifier.
The step T4 is specifically as follows: dividing the feature subset according to the feature subset division mode obtained by decoding, representing the feature subset by using the corresponding centroid, and adding each weight w k (y i ) Conversion of the corresponding 4-bit binary number into a decimal integer Q k (y i ) Each weight w is calculated using a corresponding decimal integer k (y i ) Corresponding weight
Figure BDA0002411268140000081
And T5, calculating the fitness value of the population individual, namely performing defect prediction by using a software defect prediction model according to the parameter value obtained by decoding, and obtaining the fitness value of the population individual according to a defect prediction result and a fitness function.
And T6, carrying out selection operation according to the individual fitness value of the population, using a tournament selection strategy, taking out a certain number of individuals from the population each time, then selecting the best one of the individuals to add into a new population, and repeating the operation until the new population reaches the set population size.
And T7, performing cross operation, namely randomly pairing the selected new population individuals two by two, randomly determining two cross points in an individual code string by adopting a two-point cross strategy, and then performing partial gene exchange to form new two individuals to be added into the sub population.
And T8, performing mutation operation, namely determining a mutation point for each individual in the population based on the mutation probability, performing mutation operation, and adding a new individual obtained by mutation into the sub-population.
T9, performing the same decoding operation as the step T4 on the varied population individuals to obtain corresponding parameter values, performing defect prediction according to the defect prediction methods of the step T1 and the step T2, and calculating the fitness values of the varied population individuals after obtaining prediction results;
t10, judging whether iteration is terminated, if the fitness value of the population individual is not improved any more or reaches the maximum iteration times after multiple iterations, stopping the algorithm, and outputting the individual with the maximum fitness obtained in the evolution process as an optimal solution; otherwise, the genetic operation of the step T6 to the step T9 is repeatedly executed, and the individuals are continuously updated to obtain new populations until the termination condition is met.
In summary, the software defect prediction method based on feature set division and ensemble learning of the present invention obtains an original data set from historical software data, and divides the original data set into a training data set and a testing data set; vertically dividing a training data set into a plurality of mutually exclusive feature subsets, each feature subset being represented by its centroid; selecting K base classifiers for ensemble learning, classifying the input examples into defective classes or non-defective classes according to corresponding discriminant functions by the base classifiers for a given input example, and assigning corresponding weights to the discriminant functions of the base classifiers; according to the selected base classifier and the corresponding weight thereof, fusing the base classifier in each feature subset through a weighting method, and synthesizing a set classifier of each feature subset; selecting a characteristic subset which is most similar to the input example according to a distance measurement method, and selecting an integrated classifier corresponding to the characteristic subset to carry out defect prediction on the input example, thereby establishing a software defect prediction model; dividing the test data set and searching a characteristic subset which is most similar to the input example; and simultaneously searching the optimal values of the mass center set in the characteristic subset dividing process and the weight set in the integrated learning process by utilizing a genetic algorithm, and optimizing the software defect prediction model by combining the most similar characteristic subset selected from the test data set to fulfill the aim of minimizing the prediction error of the defect prediction model. The method makes full use of the maximum association between different feature sets and the optimal classifier combination, optimizes the division of feature sets and the weight distribution of the base classifiers in the discriminant function of the collective decision method at the same time, can give full play to the local classification capability of the given base classifier, improves the prediction accuracy of the software defect prediction model, and can reduce the influence of the data imbalance problem and the software feature high-dimensional problem on the performance of the prediction model in the software defect prediction process.
While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims (10)

1. A software defect prediction method based on feature set division and ensemble learning is characterized by comprising the following steps:
s1, acquiring an original data set from historical software data, and dividing the original data set into a training data set and a testing data set;
s2, dividing the training data set into h mutually exclusive feature subsets, wherein each feature subset consists of a centroid C h Representing that the set of centroids of all feature subsets in the training data set is a centroid set C;
s3, selecting K base classifiers for integrated learning, wherein for one input example x, the base classifiers are respectively based on respective corresponding discriminant functions F k (y i X) classifying input instance x as y i Class and use the weight w k (y i ) Discriminant function F representing kth base classifier k (y i X) weights of all weights w in the training data set k (y i ) Is a set of weights W, where y i Is a category label;
s4, fusing the K base classifiers according to the K selected base classifiers and the weights corresponding to the K base classifiers respectively, and synthesizing an integrated classifier for each feature subset respectively;
s5, selecting a feature subset which is most similar to the input instance x, selecting an integrated classifier corresponding to the feature subset to carry out defect prediction on the input instance x, and establishing a software defect prediction model;
s6, dividing the test data set, repeating the operation from the step S3 to the step S4, and selecting a feature subset which is most similar to the input example x in the test data set;
and S7, searching the optimal values in the mass center set C in the training data set feature subset dividing process and the weight set W in the integrated learning process, and optimizing a software defect prediction model by combining the feature subset in the test data set obtained in the step S6.
2. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,
in the step S1, a ten-fold cross-validation method is adopted to divide the original data set into the training data set and the test data set.
3. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,
in step S2, the training data set is vertically divided into h mutually exclusive feature subsets, each of the divided feature subsets has the same number of samples as the training data set, and includes one feature subset of the original data set.
4. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,
in the step S4, K base classifiers are integrated and fused by a weighted fusion method, and the integration rule is as follows:
Figure FDA0002411268130000021
by using M 1 ,M 2 ,...,M h Respectively representing the integrated classifiers corresponding to the h feature subsets, and then integrating the classifiers M h The classification decision rule is as follows:
Figure FDA0002411268130000022
then M h =y i Wherein Y is Y i A collection of (a).
5. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,
in step S5, the feature subset most similar to the input instance x and the corresponding ensemble classifier are selected according to the distance metric.
6. The software defect prediction method based on feature set partitioning and ensemble learning of claim 5,
and taking the feature subset index minH which is most similar to the input instance x as a return, selecting an integrated classifier corresponding to the feature subset to carry out software module defect prediction on the input instance x, and establishing a software defect prediction model.
7. The software defect prediction method based on feature set partitioning and ensemble learning of claim 1,
in the step S7, the optimal values in the centroid set C and the weight set W are searched by adopting a genetic algorithm, and a software defect prediction model is optimized by combining the characteristic subset in the test data set obtained in the step S6.
8. The software defect prediction method based on feature set partitioning and ensemble learning according to claim 7, wherein said step S7 specifically comprises:
t1, setting various parameters in a genetic algorithm;
t2, coding chromosomes, wherein each chromosome individual consists of a centroid set C and a weight set W of the feature subset in the training data set, and binary coding is simultaneously carried out on the centroid set C and the weight set W;
t3, generating an initial individual according to the test data set, and randomly initializing the value of the initial individual to generate an initial population;
t4, performing corresponding binary decoding operation according to the coding mode to obtain a feature subset division mode and a weight distribution mode of the base classifier;
t5, calculating the fitness value of the population individual, namely performing defect prediction by using a software defect prediction model according to the parameter value obtained by decoding, and obtaining the fitness value of the population individual according to a defect prediction result and a fitness function;
t6, carrying out selection operation according to the individual fitness value of the population, using a championship selection strategy, taking out a certain number of individuals from the population each time, then selecting the best one of the individuals to add into a new population, and repeating the operation until the new population reaches the set population size;
t7, performing cross operation, namely randomly pairing the selected new population individuals pairwise, randomly determining two cross points in an individual code string by adopting a two-point cross strategy, and then performing partial gene exchange to form new two individuals to be added into the sub-population;
t8, performing mutation operation, namely determining a mutation point for each individual in the population based on the mutation probability, performing mutation operation, and adding a new individual obtained by mutation into the sub-population;
t9, performing decoding operation on the varied population individuals as same as the decoding operation in the step T4 to obtain corresponding parameter values, performing defect prediction according to the defect prediction methods in the step T1 and the step T2, and calculating the fitness values of the varied population individuals after obtaining prediction results;
t10, judging whether iteration is terminated, if the fitness value of the population individual is not improved any more or reaches the maximum iteration times after multiple iterations, stopping the algorithm, and outputting the individual with the maximum fitness obtained in the evolution process as an optimal solution; otherwise, the genetic operation of the step T6 to the step T9 is repeatedly executed, and the individuals are continuously updated to obtain a new population until the termination condition is met.
9. The software defect prediction method based on feature set partitioning and ensemble learning of claim 8,
the various parameters in step T1 include: population size, cross probability, mutation probability and maximum iteration number.
10. The software defect prediction method based on feature set partitioning and ensemble learning according to claim 8 or 9, wherein the step T2 specifically includes:
setting the total coding length of an initial chromosome as a + b;
the front a bit codes represent the division of the feature subsets, each 3 bit binary bit corresponds to the index of the feature subset where one measurement element feature is located in the training data set, the index value range is (0, h), the number of the measurement element features in the training data set is m, and the total code length of the front a bit is 3m;
the post-b-bit code represents the division of the basis classifier weights, and the discriminant function used every 4 bits to represent a basis classifier in the feature subset divides the input instance x into y i Weight w of class k (y i ) Then each feature subset corresponds to a set of weights W h The total length of coding is 4 × k × 2 bits, and the total length of the number of coding bits allocated by all feature subset weights b =4 × k × 2 × h bits.
CN202010177397.7A 2020-03-13 2020-03-13 Software defect prediction method based on feature set division and ensemble learning Active CN111400180B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010177397.7A CN111400180B (en) 2020-03-13 2020-03-13 Software defect prediction method based on feature set division and ensemble learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010177397.7A CN111400180B (en) 2020-03-13 2020-03-13 Software defect prediction method based on feature set division and ensemble learning

Publications (2)

Publication Number Publication Date
CN111400180A CN111400180A (en) 2020-07-10
CN111400180B true CN111400180B (en) 2023-03-10

Family

ID=71434785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010177397.7A Active CN111400180B (en) 2020-03-13 2020-03-13 Software defect prediction method based on feature set division and ensemble learning

Country Status (1)

Country Link
CN (1) CN111400180B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131089B (en) * 2020-09-29 2022-08-23 九江学院 Software defect prediction method, classifier, computer device and storage medium
CN112269732B (en) * 2020-10-14 2024-01-05 北京轩宇信息技术有限公司 Software defect prediction feature selection method and device
CN112258251B (en) * 2020-11-18 2022-12-27 北京理工大学 Grey correlation-based integrated learning prediction method and system for electric vehicle battery replacement demand
CN112380132B (en) * 2020-11-20 2024-03-29 北京轩宇信息技术有限公司 Countermeasure verification method and device based on unbalanced defect dataset of spaceflight software
CN112990255B (en) * 2020-12-23 2024-05-28 中移(杭州)信息技术有限公司 Device failure prediction method, device, electronic device and storage medium
CN113326182B (en) * 2021-03-31 2022-09-02 南京邮电大学 Software defect prediction method based on sampling and ensemble learning
CN113204482B (en) * 2021-04-21 2022-09-13 武汉大学 Heterogeneous defect prediction method and system based on semantic attribute subset division and metric matching
CN113268434B (en) * 2021-07-08 2022-07-26 北京邮电大学 Software defect prediction method based on Bayes model and particle swarm optimization
CN113626315B (en) * 2021-07-27 2024-04-12 江苏大学 Double-integration software defect prediction method combined with neural network
CN113852204B (en) * 2021-10-13 2022-06-07 北京智盟信通科技有限公司 Transformer substation three-dimensional panoramic monitoring system and method based on digital twinning
CN114706780B (en) * 2022-04-13 2024-07-19 北京理工大学 Software defect prediction method based on Stacking integrated learning
CN115276666B (en) * 2022-09-28 2022-12-20 汉达科技发展集团有限公司 Efficient data transmission method for equipment training simulator
CN117472789B (en) * 2023-12-28 2024-03-12 成都工业学院 Software defect prediction model construction method and device based on ensemble learning

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810102A (en) * 2014-02-19 2014-05-21 北京理工大学 Method and system for predicting software defects
WO2017181286A1 (en) * 2016-04-22 2017-10-26 Lin Tan Method for determining defects and vulnerabilities in software code
CN109977028A (en) * 2019-04-08 2019-07-05 燕山大学 A kind of Software Defects Predict Methods based on genetic algorithm and random forest

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103810102A (en) * 2014-02-19 2014-05-21 北京理工大学 Method and system for predicting software defects
WO2017181286A1 (en) * 2016-04-22 2017-10-26 Lin Tan Method for determining defects and vulnerabilities in software code
CN109977028A (en) * 2019-04-08 2019-07-05 燕山大学 A kind of Software Defects Predict Methods based on genetic algorithm and random forest

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于模糊聚类非负矩阵分解的软件缺陷预测;常瑞花等;《宇航学报》;20110930(第09期);全文 *
软件缺陷预测中基于聚类分析的特征选择方法;刘望舒等;《中国科学:信息科学》;20160920(第09期);全文 *

Also Published As

Publication number Publication date
CN111400180A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111400180B (en) Software defect prediction method based on feature set division and ensemble learning
CN108228716B (en) SMOTE _ Bagging integrated sewage treatment fault diagnosis method based on weighted extreme learning machine
CN111626336B (en) Subway fault data classification method based on unbalanced data set
CN105844300A (en) Optimized classification method and optimized classification device based on random forest algorithm
JP2018181290A (en) Filter type feature selection algorithm based on improved information measurement and ga
CN110674846A (en) Genetic algorithm and k-means clustering-based unbalanced data set oversampling method
CN108345904A (en) A kind of Ensemble Learning Algorithms of the unbalanced data based on the sampling of random susceptibility
CN113344019A (en) K-means algorithm for improving decision value selection initial clustering center
CN113344113B (en) Yolov3 anchor frame determination method based on improved k-means clustering
CN111709460A (en) Mutual information characteristic selection method based on correlation coefficient
CN109390032A (en) A method of SNP relevant with disease is explored in the data of whole-genome association based on evolution algorithm and is combined
CN117407732A (en) Unconventional reservoir gas well yield prediction method based on antagonistic neural network
CN115437960A (en) Regression test case sequencing method, device, equipment and storage medium
CN112560900B (en) Multi-disease classifier design method for sample imbalance
CN114328221A (en) Cross-project software defect prediction method and system based on feature and instance migration
CN112308160A (en) K-means clustering artificial intelligence optimization algorithm
CN111488903A (en) Decision tree feature selection method based on feature weight
CN105912887B (en) A kind of modified gene expression programming-fuzzy C-mean algorithm crop data sorting technique
CN115017125B (en) Data processing method and device for improving KNN method
CN108921021A (en) A kind of entropy-discriminate integrated model of multi-angle of view
CN113011589B (en) Co-evolution-based hyperspectral image band selection method and system
Banerjee Robust Density-Based Data Clustering Using a Quantum-Inspired Genetic Algorithm
CN117852711B (en) Rock burst intensity prediction method and system based on BOA-ensemble learning
Triguero et al. Prototype generation for nearest neighbor classification: Survey of methods
CN116756542A (en) Feature selection method, device and medium of unbalanced data for intrusion detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant