CN113127342B - Defect prediction method and device based on power grid information system feature selection - Google Patents

Defect prediction method and device based on power grid information system feature selection Download PDF

Info

Publication number
CN113127342B
CN113127342B CN202110339177.4A CN202110339177A CN113127342B CN 113127342 B CN113127342 B CN 113127342B CN 202110339177 A CN202110339177 A CN 202110339177A CN 113127342 B CN113127342 B CN 113127342B
Authority
CN
China
Prior art keywords
software
data set
tested
module
training set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110339177.4A
Other languages
Chinese (zh)
Other versions
CN113127342A (en
Inventor
沈伍强
龙震岳
张小陆
曾纪钧
梁哲恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202110339177.4A priority Critical patent/CN113127342B/en
Publication of CN113127342A publication Critical patent/CN113127342A/en
Application granted granted Critical
Publication of CN113127342B publication Critical patent/CN113127342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3668Software testing
    • G06F11/3672Test management
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Stored Programmes (AREA)

Abstract

The invention provides a defect prediction method and device based on power grid information system feature selection, wherein the method comprises the following steps: acquiring a historical version data set and a version data set of software to be tested and performing normalization processing; calculating the similarity between each instance in the version data set to be tested and each instance in the historical version data set, selecting k instances nearest to each instance in the version data set to be tested from the historical version data set according to the similarity, and constructing a training set; performing class unbalance treatment on the training set; feature selection is carried out on the class balance training set and the version data set to be tested which is subjected to normalization processing; based on the training set selected by the features, predicting the defect condition of each module of the software to be tested by utilizing a defect prediction model constructed by a classification algorithm, and obtaining the defect prediction result of each module of the software to be tested. According to the method, the characteristic difference and the data distribution difference between different versions of the software are considered, so that the software defect prediction efficiency and the software defect prediction precision are improved.

Description

Defect prediction method and device based on power grid information system feature selection
Technical Field
The invention relates to the field of software testing, in particular to a method and a device for predicting testing defects of a power grid information system.
Background
In the process of developing and operating software, the change of the software is caused by the change of requirements, performance improvement, defect repair, code reconstruction and the like, so that the software is larger and larger in scale, the functions are more and more complex, the relation between different functional modules is more and more complex, and defects in the software are unavoidable. Software testing ensures the quality of software by executing a program to discover as many software bugs as possible. Software testing is the most time and resource consuming part of software engineering, and utilizes limited test resources to test all programs. With the continuous development of distributed power sources, incremental power distribution networks and the like, the frequency, complexity and timeliness requirements of updating iteration of a power information system/a power grid information system are higher and higher, and higher requirements are put on software testing defect prediction. In project defect prediction for version-oriented iterative updating, there may be extraneous features in the source data set and the target data set, while the data distribution of the source data set and the target data set may be different. The existing defect detection method does not consider the characteristic differences and the data distribution differences, and the distribution of the test resources is insufficient, so that the defect prediction performance and efficiency are low.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the invention provides a defect prediction method based on the characteristic selection of a power grid information system, which realizes more effective allocation of test resources and improves the efficiency and quality of software test.
Another object of the present invention is to provide a defect prediction device based on the feature selection of the grid information system.
The technical scheme is as follows: according to a first aspect of the present invention, there is provided a test defect prediction method based on grid information system feature selection, comprising the steps of:
(1) Acquiring a historical version data set and a version data set of software to be tested and performing normalization processing;
(2) Calculating the similarity between each instance in the version data set to be tested and each instance in the historical version data set, selecting k instances nearest to each instance in the version data set to be tested from the historical version data set according to the similarity, and constructing a training set;
(3) Performing class unbalance treatment on the training set to obtain a class balanced training set;
(4) Feature selection is carried out on the class balance training set and the version data set to be tested which is subjected to normalization processing;
(5) Based on the training set and the testing set which are selected by the features, predicting the defect condition of each module of the software to be tested by utilizing a defect prediction model constructed by a classification algorithm, and obtaining the defect prediction result of each module of the software to be tested.
According to a second aspect of the present invention, there is provided a defect prediction apparatus based on grid information system feature selection, comprising:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and performing normalization processing;
the training set construction module is used for calculating the similarity between each instance in the version data set to be tested and each instance in the historical version data set, and selecting k instances nearest to each instance in the version data set to be tested from the historical version data set according to the similarity to construct a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the feature selection module is used for carrying out feature selection on the class balance training set and the normalized version data set to be tested;
the prediction module is used for predicting the defect condition of each module of the software to be tested by utilizing the defect prediction model constructed by the classification algorithm based on the training set and the testing set selected by the characteristics, and obtaining the defect prediction result of each module of the software to be tested.
The beneficial effects are that: according to the defect prediction method and device based on the characteristic selection of the power grid information system, the data quality is improved through preprocessing the data set; the data distribution of the historical version data set is consistent with that of the current version data set to be tested through example selection; selecting the characteristics which are strongly related to the defects through characteristic selection, removing irrelevant characteristics, and improving the performance and efficiency of defect prediction; and constructing a defect prediction model on the historical version data set by adopting an improved classification algorithm, predicting defect tendencies of each module of the current version to be detected, finally realizing accurate and effective prediction, and simultaneously, recording and updating corresponding parameters of the prediction model to be used as support data for the prediction of the power grid information system test defects. The invention can effectively assist a software tester to predict the possibly defective software module before the software test, thereby more effectively distributing the test resources and further improving the efficiency and quality of the software test.
Drawings
FIG. 1 is a general schematic diagram of a defect prediction method based on grid information system feature selection of the present invention;
FIG. 2 is a flow chart of a defect prediction method based on grid information system feature selection of the present invention.
Detailed Description
With the continuous perfection of power grid information systems, historical versions of different information systems of a power grid are more and more, for continuous software projects with historical versions, defect data of the historical versions of the software are mined according to a certain test experience before the software is tested, a defect prediction model is constructed by utilizing data mining and machine learning algorithms, and the defect condition of each module of a subsequent version can be effectively predicted. Software defects are not randomly distributed, their distribution is regularly circulated. By mining software historical defect data and analyzing the defect distribution rules, software modules with a defect tendency can be accurately predicted, and most of the test resources are allocated to them without spending resources on modules that do not have a defect tendency. And further, on the premise of ensuring the software testing quality, the testing resources can be effectively distributed, and the efficiency of the software testing is obviously improved.
Aiming at the continuous development and application of class project software codes in a power grid information system, the embodiment of the invention comprehensively considers the influence of data distribution of a data set on defect prediction before performing software test, and provides a defect prediction method based on power grid information system feature selection, which generally comprises the following steps: by recommending similar historical version data, the data distribution of the historical version data set and the data distribution of the version data set to be tested are consistent, and the distribution rule of software defects in the historical version is searched. The method further comprises the steps of: and the characteristics which are strongly related to the defects are effectively selected through a characteristic selection algorithm, an improved AdaBoost classification algorithm is utilized to construct a defect prediction model for training and analysis, and meanwhile, corresponding parameters of the prediction model are recorded and updated to serve as support data for testing defect prediction of the power grid information system.
A specific description of the steps for carrying out the method of the invention is given below with reference to the accompanying drawings. It should be noted that the steps described below are for the purpose of illustrating the invention only and are not limiting of the invention.
As shown in fig. 1 and 2, in step1, a historical version data set and a version data set to be tested are constructed.
In an embodiment, the historical version dataset may be constructed based on the number of features, instances (defect number, defect free number) of the information system historical version test used by the grid. Features refer to software metrics of information system software. The software metrics include code metrics and process metrics. For example, in the running process of all informationized test projects in the power grid informationized department, all code warehouses, version control systems and the like which are developed by java language and are oriented to different applications and have a plurality of continuous versions are subjected to data mining, classes in the project history version modules are recorded, software measuring elements related to defects are designed, such as code loop complexity, code change line numbers and the like, and the measuring history version modules are marked as being defect-free. Software metrics refer to indicators and parameters describing the characteristics of a software product, and may also be understood as software characteristics. Currently, software metrics are largely divided into code metrics and process metrics. The code metric element mainly refers to the complexity of a loop and describes the complexity of a software code structure. Process metrics include primarily code change based, developer information based, development process related metrics. The method mainly comprises the changing times, the number of developers, the number of code changing lines and the like. The presence of defect-free marks for historical version module instances may be determined by experience such as historical test records for the values of the metrics for each software module instance.
The constructed historical version dataset is expressed as: DATA = { (a) 1 ,b 1 ),(a 2 ,b 2 ),…,(a i ,b i ),…,(a n ,b n )},a i =(f i,1 ,f i,2 ,…,f i,j ,…,f i,d ) Wherein a is i Representing software module instances, b i Representing the class of the instance, b i E Y, Y= { defective, non-defective }, n represents the number of instances, f i,j Representation example a i D represents the number of software metrics.
And for the data set of the version to be tested, acquiring the characteristic index and the parameter of the version to be tested based on the same code metric element, namely the value of each software metric element in the software module instance.
The resulting historical version data set and the version data set to be tested are collectively referred to as the original data set.
In step2, preprocessing is performed on the data in the constructed historical version data set and the version data set to be tested.
Preprocessing the data recorded in the original data set, wherein the preprocessing comprises the following steps: checking the consistency of the data, carrying out data normalization processing, removing the data with obvious distortion, and carrying out effective arrangement and storage. In the data normalization processing, the value ranges of different software measuring elements are different, random forest filling missing values are selected according to different influence degrees of different characteristic values on defects, and the value ranges of the software measuring elements are normalized by adopting a Max-min method to be [0,1] so as to eliminate the influence on defect prediction results caused by the different value ranges of the different software measuring elements. The formula of normalization processing is:
Figure BDA0002998620810000041
wherein p is i,j The value of the j-th software metric element representing the i-th software module after normalization processing, q i,j The value of the jth software metric element representing the ith software module before normalization, min (q j ) Representing the minimum value of the jth software metric element in all software modules, max (q j ) Representing the maximum value. In the description of the present invention, software modules, software module instances, instances may be used interchangeably.
In step3, a training set and a testing set are constructed according to the preprocessed data.
Whether to conduct instance recommendation can be selected according to actual needs, if no change of developers, development environments and the like occurs in the project development process, namely, the data distribution of the historical version data set is consistent with that of the current version data set to be tested, instance recommendation operation can be omitted, and otherwise instance recommendation is conducted. Due to the continued development of one software project, developers of different versions, development environments, etc. change, and the data distribution of the data set changes. Effective characteristic data can be effectively selected through example recommendation, and the prediction performance is improved.
The method specifically comprises the following steps: and then, k adjacent examples with the smallest Euclidean distance with each example in the current version data set to be measured are selected from the historical version data set. The repeated instances in all k-nearest neighbors are taken only once, resulting in a new dataset. The effect of k value on the algorithm is tested for multiple times, and k is taken as 8. The calculation formula of the euclidean distance is the prior art, and is not described herein.
The historical version data set obtained through the processing is used as a training set, and the version data set to be tested is used as a testing set.
In step4, a class imbalance process is performed on the training set.
In most cases, the number of flawless module instances is much greater than the number of flawed module instances, so there is a class imbalance problem with the training set data. The correct classification of minority class samples in unbalanced data set classification tends to be more important than majority class samples. The invention carries out class unbalance processing on the training set by adopting the SMOTE method so as to balance the number of defective module examples (minority class samples) with the number of non-defective module examples (majority class samples) to obtain the class balanced training set.
The SMOTE sampling is to process the minority class to generate minority class data so as to achieve the purpose of balancing the data set. The algorithm is improved on the basis of random oversampling, firstly, a minority class sample of k neighbor of minority class x is obtained, it is understood that the k value of k neighbor is not necessarily equal to the k value of k neighbor selected in step3, sampling multiplying power N is set according to the proportion of unbalanced data, and x is assumed to be set n For a minority class sample in the k-nearest neighbor of x, the sampling is performed according to the following formulaSample (2):
X new =X+rand(0,1)*|X-X n |
the complete steps are as follows:
step1. For a random minority class instance p, the distance from the random minority class instance p to all instances in the minority class instance is measured by taking Euclidean distance as a standard, so that k neighbor instances are obtained.
Step2, randomly extracting R < k neighbors with a put back.
Step3. For the R instances, each instance and instance p may form a straight line, and then a new sample is generated by taking an instance randomly along the straight line, and doing so continuously, so that a total of R new instances may be generated.
Step4. These new points are added to the sample set.
The new samples synthesized by the simple random oversampling method have problems of blindness and limitation because the method is to randomly copy the minority samples to increase the minority samples. The SMOTE algorithm uses a linear interpolation method and synthesizes new minority class samples according to some specific rules. Therefore, the problem that the decision domain becomes small due to the increase of the number of the minority class samples can be prevented while the number of the minority class samples is increased, so that the algorithm is prevented from being over-fitted to a certain extent, and the purpose of improving the performance of the classifier is achieved.
In step 5, feature selection is performed according to the class balance training set and the data set of the current version to be tested after normalization processing.
And (3) carrying out feature sequencing on the training set by using a feature sequencing method, selecting features which are strongly related to the defects, and removing irrelevant features. It may be selected whether feature selection is performed, and if feature selection is performed, a feature ordered list is obtained by a ReliefF algorithm (RF). And selecting the appointed number of features with the top ranking from the feature ranking list according to the set number of features to be selected, and displaying the selected features in the form of serial numbers and names. Finally, selecting the characteristics from the class-balanced training set and the normalized current version data set to be tested, and removing the rest characteristics to obtain a training set after characteristic selection and a testing set after characteristic selection.
In step 6, a defect prediction model is built on the historical version data set after recommendation selection and feature selection processing by using a classification algorithm, a test set after feature selection is input, the defect condition of each module of the current version to be detected is predicted, and the defect prediction result of each module of the current version to be detected is returned.
In order to pursue further improvement of accuracy and recall rate of minority class identification, classification algorithms are improved. Most of traditional classification algorithms assume that the misclassification cost is the same and take the improvement of the classification precision of the classifier as a final goal, so that when the classification problem of an unbalanced data set is processed, a few class samples are generally classified into a plurality of classes, and the classification precision of the classifier is further improved. But the correct classification of minority class samples in unbalanced data set classification tends to be more important than majority class samples. Cost sensitive learning is based on the theory, and the misclassification cost is higher for a minority class sample of misclassification. In the embodiment, the processed data set is subjected to effective classification prediction by a defect prediction model constructed by an improved Adaboost classification algorithm, so that the aim of improving the classification effect of the classifier on a few class samples is fulfilled.
Different from the traditional Adaboost algorithm, the method changes the weight updating mode of the Adaboost by introducing the cost matrix into the weight updating formula, so that fewer samples with wrong classification obtain higher weight, and the weight of the samples with correct classification is reduced. The specific mode is to modify a sample weight updating formula in Adaboost to
Figure BDA0002998620810000061
Update to->
Figure BDA0002998620810000062
Here β (i), i.e. the cost-sensitive function in the case of cost matrix determination, the grid dataset processed in the present invention does not have a well-defined cost matrix, so that in the present invention for β (i) it is equivalent to directly giving a coefficient K (K>1) When the weak classifier is positiveWhen determining the class, β (i) =1 remains unchanged and the sample weight is normally reduced; when the minority class is classified into the majority class, β (i) =k, and the weight of the sample increases at a faster rate; when the majority class is classified into the minority class, β (i) =1 remains unchanged and the sample weight normally increases. Beta (i) is referred to herein as the cost-sensitive compensation parameter for the ith instance. By the method, the weight of the misclassified minority samples can be improved, and the recognition rate of the minority samples can be improved more quickly.
The specific flow of the improved Adaboost algorithm is as follows:
input: training the set after feature processing; iteration times T; a base learning algorithm;
and (3) outputting: combined classifier
step1 initializing the sample weights in the training set to D 1 (i)=1/n。
step2 for i=1, the term "T", iterative execution training of the T-th weak classifier h on a training set t (x) And calculates the error rate epsilon of the t-th iteration classifier t Also known as error:
Figure BDA0002998620810000071
step3. Estimate error, if ε t >0.5 or epsilon t =0, the classifier is not qualified, the iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
Figure BDA0002998620810000072
α t as the weight of the weak classifier(s),
Figure BDA0002998620810000073
for the normalization constant(s),
Figure BDA0002998620810000074
step4 output groupCombining the classifier:
Figure BDA0002998620810000075
and adding the cost of misclassification of the minority class samples into the weight updating formula, so that the minority class misclassified samples obtain more sample weight. By the method, the prediction accuracy of few types of samples can be improved more in the same iteration times.
By the method, on one hand, a software tester is helped to predict that a software module possibly having defects provides corresponding data support before the software test, the effective distribution of test resources is guided, and the test efficiency is improved; on the other hand, the reasons for generating the software defects are analyzed, the software development process is improved, and the development quality of the follow-up version is improved.
After carrying out example selection and class unbalance processing on a historical version data set, selecting key features through feature selection to obtain a test set and a training set after feature selection, optimizing an Adaboost prediction model for further optimizing the effect of minority class test defect prediction, and comparing a prediction model constructed by a decision tree, an Adaboost algorithm and an improved Adaboost algorithm through experiments. By comprehensively considering misjudgment and missed judgment of the prediction model on a few examples, the f1 score of the improved classification model method provided by the invention is about 5% higher than that of a precision tree and an Adaboost respectively, namely the method has the advantages that the accuracy rate of recognition and prediction of the few examples is better improved on the basis of ensuring the full recognition rate, and the prediction performance is effectively improved.
Aiming at the characteristics of the power information system, the invention uses the software defect detection technology to mine the defect data of the historical version of the software before the software test based on the multi-historical version continuous software code existing in the typical application of the power information system, and utilizes the data mining and machine learning algorithm to construct a defect prediction model so as to effectively predict the defect condition of each module of the subsequent version. The invention can be used for software defect detection and software defect prediction solutions.
In another embodiment, a defect prediction apparatus based on grid information system feature selection is provided, comprising:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and performing normalization processing;
the training set construction module is used for calculating the similarity between each instance in the version data set to be tested and each instance in the historical version data set, and selecting k instances nearest to each instance in the version data set to be tested from the historical version data set according to the similarity to construct a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the feature selection module is used for carrying out feature selection on the class balance training set and the normalized version data set to be tested;
the prediction module is used for predicting the defect condition of each module of the software to be tested by utilizing the defect prediction model constructed by the classification algorithm based on the training set and the testing set selected by the characteristics, and obtaining the defect prediction result of each module of the software to be tested.
Wherein, the data acquisition module includes:
a history version DATA set obtaining unit for recording classes in the software modules of the history version, measuring whether the software modules of the history version are defective or not according to the software metric elements related to the defects, and obtaining a history version DATA set expressed as DATA= { (a) 1 ,b 1 ),(a 2 ,b 2 ),…,(a i ,b i ),…,(a n ,b n )},a i =(f i,1 ,f i,2 ,…,f i,j ,…,f i,d ) Wherein a is i Representing software module instances, b i Representing the class of the instance, b i E Y, Y= { defective, no defectNotch }, n represents the number of instances, f i,j Representation example a i D represents the number of software metrics;
the system comprises a to-be-tested version data set acquisition unit, a software module detection unit and a software module detection unit, wherein the to-be-tested version data set acquisition unit is used for acquiring classes in a to-be-tested version software module and acquiring values of all software measurement elements in a software module instance according to software measurement elements related to defects;
the normalization processing unit is used for filling the missing values in the random forest, normalizing the value range of each software metric element by adopting a Max-min method, and the formula is as follows:
Figure BDA0002998620810000081
wherein p is i,j The value of the j-th software metric element representing the i-th software module after normalization processing, q i,j The value of the jth software metric element representing the ith software module before normalization, min (q j ) Representing the minimum value of the jth software metric element in all software modules, max (q j ) Representing the maximum value of the j-th software metric element in all software modules.
As a preferred embodiment, the training set processing module performs a quasi-unbalanced processing on the training set by using an SMOTE sampling algorithm, where the training set processing module specifically includes:
the minority class neighbor determining unit is used for measuring distances from a random minority class instance p to all instances in the minority class instance by taking Euclidean distance as a standard to obtain k neighbor instances;
the neighbor extraction unit is used for randomly extracting R < k neighbors with a place back;
the new sample generation unit is used for randomly extracting R examples from the neighbor extraction unit, forming a straight line by each example and the example p, and randomly taking one example on the straight line to generate a new sample, so as to generate R new samples in total; and
and the training set updating unit is used for adding the new sample generated by the new sample generating unit into the training set to obtain a class-balanced training set.
As a preferred implementation manner, the feature selection module obtains a feature sorting list through a ReliefF algorithm, selects a specified number of features with top ranking from the feature sorting list, selects the features from the training set with class balance and the normalized current version data set to be tested, removes the rest of the features, and obtains a training set with feature selection and a test set with feature selection.
As a preferred embodiment, the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, and the defect prediction unit predicts the defect condition of each module of the software to be tested according to the test set data after feature selection by using the trained defect prediction model to obtain the defect prediction result of each module of the software to be tested;
the defect prediction model construction unit comprises:
an initialization unit for initializing the sample weights in the training set to D 1 (i) =1/n, n being the number of instances;
an iterative execution unit, for i=1,.. iterative execution training of the T-th weak classifier h on a training set t (x) And calculates the error rate epsilon of the t-th iteration classifier t
Figure BDA0002998620810000091
T is the iteration number, y i Category for the i-th instance in the training set;
wherein when epsilon t If the number of the classifiers is smaller than a preset threshold, the classifier is unqualified, and iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
Figure BDA0002998620810000092
wherein alpha is t To be weakly classifiedThe weight of the device is calculated,
Figure BDA0002998620810000101
to normalize the constants, beta (i) is a cost sensitive compensation parameter,
Figure BDA0002998620810000102
an output unit for outputting the combined classifier:
Figure BDA0002998620810000103
/>
as a preferred embodiment, the defect prediction apparatus further includes: and the optimizing module is used for taking the data set of the version to be tested, which is selected by the characteristics, as a test set and optimizing and updating the prediction model.
It should be understood that the defect prediction device based on the feature selection of the grid information system in the embodiment of the present invention may implement all the technical solutions in the above method embodiments, and the functions of each functional module may be specifically implemented according to the method in the above method embodiments, and specific implementation processes and calculation formulas that are not described in detail in the device embodiments may refer to the relevant descriptions in the above embodiments.
Based on the same technical concept as the method embodiment, according to another embodiment of the present invention, there is provided a computer apparatus including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs when executed by the processors implement the steps in the method embodiments.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the claims.

Claims (6)

1. A defect prediction method based on grid information system feature selection, the method comprising the steps of:
(1) Acquiring a historical version data set and a version data set of the software to be tested and performing normalization processing, wherein the acquiring the historical version data set of the software to be tested comprises: recording classes in the software module of the historical version, measuring the existence of the software module of the historical version according to the software metric elements related to the defects, and obtaining a DATA set of the historical version, which is expressed as DATA= { (a) 1 ,b 1 ),(a 2 ,b 2 ),…,(a i ,b i ),…,(a n ,b n )},a i =(f i,1 ,f i,2 ,…,f i,j ,…,f i,d ) Wherein a is i Representing software module instances, b i Representing the class of the instance, b i E Y, Y= { defective, non-defective }, n represents the number of instances, f i,j Representation example a i D represents the number of software metrics;
the obtaining of the version data set to be tested comprises the following steps: acquiring classes in the software module of the version to be tested, and acquiring the values of each software metric element in the software module instance according to the software metric element related to the defect;
the normalization process includes: selecting a random forest to fill the missing value, and normalizing the value range of each software metric element by adopting a Max-min method, wherein the formula is as follows:
Figure FDA0004207012420000011
wherein p is i,j The value of the j-th software metric element representing the i-th software module after normalization processing, q i,j The value of the jth software metric element representing the ith software module before normalization, min (q j ) Representing all softwaresMinimum value of jth software metric element in part module, max (q j ) Representing the maximum value of the j-th software metric element in all the software modules;
(2) Calculating the similarity between each instance in the version data set to be tested and each instance in the historical version data set, selecting k instances nearest to each instance in the version data set to be tested from the historical version data set according to the similarity, and constructing a training set;
(3) Performing class unbalance treatment on the training set to obtain a class balanced training set;
(4) Feature selection is carried out on the class balance training set and the version data set to be measured which is subjected to normalization processing, and the feature selection comprises the following steps: obtaining a feature ordering list through a ReliefF algorithm, selecting a specified number of features with top ranking from the feature ordering list, selecting the features from a class-balanced training set and a normalized current version data set to be tested, removing the other features, and obtaining a training set after feature selection and a test set after feature selection;
(5) Based on the training set and the testing set selected by the characteristics, predicting the defect condition of each module of the software to be tested by utilizing a defect prediction model constructed by a classification algorithm to obtain the defect prediction result of each module of the software to be tested, wherein the defect prediction model is constructed by adopting an improved Adaboost algorithm, and the method comprises the following steps:
(5-1) initializing the sample weights in the training set to D 1 (i) =1/n, n being the number of instances;
(5-2) for i=1,.. iterative execution training of the T-th weak classifier h on a training set t (x) And calculates the error rate epsilon of the t-th iteration classifier t
Figure FDA0004207012420000021
T is the iteration number, y i Category for the i-th instance in the training set;
(5-3) when ε t If the number of the classifiers is smaller than a preset threshold, the classifier is unqualified, and iteration is terminated; otherwise press downUpdating the weight of the sample:
Figure FDA0004207012420000022
wherein alpha is t As the weight of the weak classifier(s),
Figure FDA0004207012420000023
to normalize the constants, beta (i) is a cost sensitive compensation parameter,
Figure FDA0004207012420000024
(5-4) outputting a combined classifier:
Figure FDA0004207012420000025
2. the defect prediction method based on grid information system feature selection according to claim 1, wherein the similarity between each instance in the version data set to be measured and each instance in the historical version data set is calculated by using the euclidean distance in the step (2).
3. The defect prediction method based on grid information system feature selection according to claim 1, wherein the step (3) of performing the unbalance-like processing on the training set by using the SMOTE sampling algorithm comprises the following steps:
(3-1) for a random minority class instance p, measuring the distance from the random minority class instance p to all instances in the minority class instance by taking Euclidean distance as a standard to obtain k neighbor instances;
(3-2) randomly extracting R < k neighbors with a put-back ground;
(3-3) for the R instances, each instance forms a line with instance p, and taking one instance randomly on the line, a new sample is generated, and a total of R new samples are generated;
(3-4) adding the newly generated samples to the training set to obtain a class-balanced training set.
4. A defect prediction device based on grid information system feature selection, comprising:
the data acquisition module is used for acquiring a historical version data set and a version data set of the software to be tested and performing normalization processing;
the training set construction module is used for calculating the similarity between each instance in the version data set to be tested and each instance in the historical version data set, and selecting k instances nearest to each instance in the version data set to be tested from the historical version data set according to the similarity to construct a training set;
the training set processing module is used for carrying out class unbalance processing on the training set to obtain a class balanced training set;
the feature selection module is used for carrying out feature selection on the class balance training set and the normalized version data set to be tested;
the prediction module is used for predicting the defect condition of each module of the software to be tested by utilizing the defect prediction model constructed by the classification algorithm based on the training set and the testing set selected by the characteristics to obtain the defect prediction result of each module of the software to be tested;
wherein, the data acquisition module includes:
a history version DATA set obtaining unit for recording classes in the software modules of the history version, measuring whether the software modules of the history version are defective or not according to the software metric elements related to the defects, and obtaining a history version DATA set expressed as DATA= { (a) 1 ,b 1 ),(a 2 ,b 2 ),…,(a i ,b i ),…,(a n ,b n )},a i =(f i,1 ,f i,2 ,…,f i,j ,…,f i,d ) Wherein a is i Representing software module instances, b i Representing the class of the instance, b i E Y, Y= { defective, non-defective }, n represents the number of instances, f i,j Representation example a i D represents the number of software metrics;
the system comprises a to-be-tested version data set acquisition unit, a software module detection unit and a software module detection unit, wherein the to-be-tested version data set acquisition unit is used for acquiring classes in a to-be-tested version software module and acquiring values of all software measurement elements in a software module instance according to software measurement elements related to defects;
the normalization processing unit is used for filling the missing values in the random forest, normalizing the value range of each software metric element by adopting a Max-min method, and the formula is as follows:
Figure FDA0004207012420000031
wherein p is i,j The value of the j-th software metric element representing the i-th software module after normalization processing, q i,j The value of the jth software metric element representing the ith software module before normalization, min (q j ) Representing the minimum value of the jth software metric element in all software modules, max (q j ) Representing the maximum value of the j-th software metric element in all the software modules;
the feature selection module obtains a feature ordering list through a ReliefF algorithm, selects the appointed number of features with the top ranking from the feature ordering list, selects the features from the class-balanced training set and the normalized current version data set to be tested, removes the other features, and obtains a training set after feature selection and a testing set after feature selection;
the prediction module comprises a defect prediction model construction unit and a defect prediction unit, the defect prediction model construction unit adopts an improved Adaboost algorithm to construct a defect prediction model and train the model, the defect prediction unit predicts the defect condition of each module of the software to be tested by utilizing the trained defect prediction model based on the test set data after feature selection to obtain the defect prediction result of each module of the software to be tested,
the defect prediction model construction unit includes:
an initialization unit for initializing the sample weights in the training setIs D as 1 (i) =1/n, n being the number of instances;
an iterative execution unit, for i=1,.. iterative execution training of the T-th weak classifier h on a training set t (x) And calculates the error rate epsilon of the t-th iteration classifier t
Figure FDA0004207012420000041
T is the iteration number, y i Category for the i-th instance in the training set;
wherein when epsilon t If the number of the classifiers is smaller than a preset threshold, the classifier is unqualified, and iteration is terminated; otherwise, updating the weight of the sample according to the following formula:
Figure FDA0004207012420000042
wherein alpha is t As the weight of the weak classifier(s),
Figure FDA0004207012420000043
to normalize the constants, beta (i) is a cost sensitive compensation parameter,
Figure FDA0004207012420000044
an output unit for outputting the combined classifier:
Figure FDA0004207012420000045
5. the grid information system feature selection-based defect prediction device of claim 4, wherein the training set construction module calculates the similarity between each instance in the version data set to be tested and each instance in the historical version data set using euclidean distance.
6. The defect prediction device based on grid information system feature selection according to claim 4, wherein the training set processing module performs a quasi-unbalanced processing on a training set by using SMOTE sampling algorithm, and the training set processing module specifically includes:
the minority class neighbor determining unit is used for measuring distances from a random minority class instance p to all instances in the minority class instance by taking Euclidean distance as a standard to obtain k neighbor instances;
the neighbor extraction unit is used for randomly extracting R < k neighbors with a place back;
the new sample generation unit is used for randomly extracting R examples from the neighbor extraction unit, forming a straight line by each example and the example p, and randomly taking one example on the straight line to generate a new sample, so as to generate R new samples in total; and
and the training set updating unit is used for adding the new sample generated by the new sample generating unit into the training set to obtain a class-balanced training set.
CN202110339177.4A 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection Active CN113127342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110339177.4A CN113127342B (en) 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110339177.4A CN113127342B (en) 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection

Publications (2)

Publication Number Publication Date
CN113127342A CN113127342A (en) 2021-07-16
CN113127342B true CN113127342B (en) 2023-06-09

Family

ID=76774868

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110339177.4A Active CN113127342B (en) 2021-03-30 2021-03-30 Defect prediction method and device based on power grid information system feature selection

Country Status (1)

Country Link
CN (1) CN113127342B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114356641B (en) * 2022-03-04 2022-05-27 中南大学 Incremental software defect prediction method, system, equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
WO2017131263A1 (en) * 2016-01-29 2017-08-03 한국과학기술원 Hybrid instance selection method using nearest neighboring point for cross project defect prediction
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
CN109977028A (en) * 2019-04-08 2019-07-05 燕山大学 A kind of Software Defects Predict Methods based on genetic algorithm and random forest
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
WO2020199345A1 (en) * 2019-04-02 2020-10-08 广东石油化工学院 Semi-supervised and heterogeneous software defect prediction algorithm employing github
CN112465040A (en) * 2020-12-01 2021-03-09 杭州电子科技大学 Software defect prediction method based on class imbalance learning algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180210944A1 (en) * 2017-01-26 2018-07-26 Agt International Gmbh Data fusion and classification with imbalanced datasets

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677564A (en) * 2016-01-04 2016-06-15 中国石油大学(华东) Adaboost software defect unbalanced data classification method based on improvement
WO2017131263A1 (en) * 2016-01-29 2017-08-03 한국과학기술원 Hybrid instance selection method using nearest neighboring point for cross project defect prediction
CN108563556A (en) * 2018-01-10 2018-09-21 江苏工程职业技术学院 Software defect prediction optimization method based on differential evolution algorithm
WO2020199345A1 (en) * 2019-04-02 2020-10-08 广东石油化工学院 Semi-supervised and heterogeneous software defect prediction algorithm employing github
CN109977028A (en) * 2019-04-08 2019-07-05 燕山大学 A kind of Software Defects Predict Methods based on genetic algorithm and random forest
AU2020100709A4 (en) * 2020-05-05 2020-06-11 Bao, Yuhang Mr A method of prediction model based on random forest algorithm
CN112465040A (en) * 2020-12-01 2021-03-09 杭州电子科技大学 Software defect prediction method based on class imbalance learning algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于迁移学习的软件缺陷预测;程铭;毋国庆;袁梦霆;;电子学报(第01期);全文 *

Also Published As

Publication number Publication date
CN113127342A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
CN106201871B (en) Based on the Software Defects Predict Methods that cost-sensitive is semi-supervised
CN113792825B (en) Fault classification model training method and device for electricity information acquisition equipment
US11580425B2 (en) Managing defects in a model training pipeline using synthetic data sets associated with defect types
CN109891508A (en) Single cell type detection method, device, equipment and storage medium
CN111582315A (en) Sample data processing method and device and electronic equipment
CN113127342B (en) Defect prediction method and device based on power grid information system feature selection
JP7190246B2 (en) Software failure prediction device
CN117574201A (en) Model training method, device, equipment and storage medium based on multi-industry model
CN111582313A (en) Sample data generation method and device and electronic equipment
CN114139636B (en) Abnormal operation processing method and device
JP2019003333A (en) Bug contamination probability calculation program and bug contamination probability calculation method
CN113032547B (en) Big data processing method and system based on artificial intelligence and cloud platform
CN112148605B (en) Software defect prediction method based on spectral clustering and semi-supervised learning
CN111026661B (en) Comprehensive testing method and system for software usability
US11520831B2 (en) Accuracy metric for regular expression
JP4308113B2 (en) Data analysis apparatus and method, and program
CN113204482B (en) Heterogeneous defect prediction method and system based on semantic attribute subset division and metric matching
US20220108216A1 (en) Machine learning apparatus, method, and non-transitory computer readable medium storing program
CN117313900B (en) Method, apparatus and medium for data processing
CN117313899B (en) Method, apparatus and medium for data processing
CN114580982B (en) Method, device and equipment for evaluating data quality of industrial equipment
CN113434408B (en) Unit test case sequencing method based on test prediction
Gaol et al. Software testing model by measuring the level of accuracy fault output using neural network algorithm
Singh et al. Improved Software Fault Prediction Model Based on Optimal Features Set and Threshold Values Using Metaheuristic Approach
CN118036920A (en) Supplier competition type matching method and system based on photovoltaic demand

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant