CN106991049A - A kind of Software Defects Predict Methods and forecasting system - Google Patents
A kind of Software Defects Predict Methods and forecasting system Download PDFInfo
- Publication number
- CN106991049A CN106991049A CN201710212286.3A CN201710212286A CN106991049A CN 106991049 A CN106991049 A CN 106991049A CN 201710212286 A CN201710212286 A CN 201710212286A CN 106991049 A CN106991049 A CN 106991049A
- Authority
- CN
- China
- Prior art keywords
- sample
- node
- adjacent map
- weight
- class
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
- G06F11/3608—Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/192—Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
- G06V30/194—References adjustable by an adaptive method, e.g. learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of Software Defects Predict Methods, this method will have category label and sample without category label to be jointly processed by, semi-supervised learning is used in laplacian eigenmaps (LE), improve LE methods, simultaneously, in order to avoid inhomogeneous sample is mapped in less low-dimensional neighborhood, especially defective sample is mapped in zero defect sample neighborhood, LE algorithms calculate sample point apart from when introduce cost-sensitive information, LE mapping accuracy is improved with this, the distinctive of feature extraction can be effectively improved by this method.The present invention also proposes a kind of software defect forecasting system, and the present invention is applied on NASA databases, experiments verify that the validity of institute's extracting method, and compared with other control methods, it is had a certain upgrade on classification performance.
Description
Technical field
The present invention relates to a kind of Software Defects Predict Methods and forecasting system, belong to field of software engineering.
Background technology
Software defect prediction includes data prediction, feature extraction, training forecast model, identification four processes.It is wherein special
Levy one of the problem of extraction is most basic during software defect is predicted.For software defect prediction, extraction is characterized in effectively
Complete the top priority of identification.
Existing feature extracting method can be divided into the brief method of traditional dimension and the popular study brief method of dimension.Wherein,
Traditional brief method of dimension:Including principal component analysis (PCA), Multidimensional Scaling (MDS).The prevalence study brief method of dimension:
Including Isometric Maps method (ISOMAP), laplacian eigenmaps (LE), local retaining projection (LPP) etc..
(1) principal component analysis (PCA):Its core concept is by raw sample data Linear Mapping to lower dimensional space
In so that the data after projection have the unrelated characteristic of each characteristic line, after projection, higher-dimension in new lower dimensional space
Data can be mapped as the data of low-dimensional, so as to realize data reduction.Finally it is required to meet the v of following formula:
StV=λ v (1)
Wherein StOverall Scatter Matrix is represented, λ is the corresponding characteristic values of v.
(2) Multidimensional Scaling (MDS):By analyzing set of metadata of similar data come the concealed structure information in mining data, MDS is calculated
The purpose of method is the sample that lower dimensional space is reconstructed between known original sample in the case of distance so that in the sample of lower dimensional space
This distance and original sample is as identical as possible in the distance of higher dimensional space.The distance of lower dimensional space reconstructed sample and higher-dimension are original
The error of sample distance is represented using an error function, by solving the error function, can obtain the number after mapping
According to.Following formula is solved, the matrix Z after mapping is drawn:
Wherein, Z={ zi,li, i=1 ..., n, di,j=d (xi,xj)=(xi-xj)T(xi-xj)。
(3) Isometric Maps method (ISOMAP):It is that MDS one kind is improved, in ISOMAP, is substituted using geodesic distance
Euclidean distance in MDS, this mode can obtain optimal geometry in the overall situation, can preserve low-dimensional popular structure.
The solution of geodesic distance is divided into two kinds of situations to consider, a kind of situation is the as Neighbor Points if two samples are apart from close, then
Directly carry out approximate geodesic distance using Euclidean distance;Another situation is the as non-near adjoint point if two samples are distant,
The shortest path on neighbour's figure can be then used to represent.After geodesic distance is obtained, it is possible to carried out using MDS methods etc.
Away from mapping, represented so as to obtain low-dimensional data.
ISOMAP can reflect the inwardness of nonlinear data, but this method does not account for the part of data sample
Relation, and in dimensionality reduction, ISOMAP may produce the error of " Elbow " phenomenon, and the main cause for causing the shortcoming is exactly
Geodesic distance is measured.
(4) laplacian eigenmaps (LE):The neighbor relationships between data are constructed from local so that in higher dimensional space
Local structural information can be kept in the mapped, and LE algorithms are more visual and understandable when constructing dimensionality reduction target.
In order that it is also neighbour to obtain in the space of neighbour's sample in the projected in higher dimensional space, that is, keep sample
Near-neighbor Structure information, it is assumed that y=[y1,y2,...,yk] be the sample for projecting to lower dimensional space, then what LE algorithms needed solution is
Following minimization problem:
Wherein, W is weight matrix.Above formula can be finally converted to the solution of generalized eigenvalue problem:
Lf=λ Df (4)
Wherein, D is diagonal matrix, and its element is the summation of summation either W every a line of W each row, i.e.,
Dii=∑jWji, L=D-W is Laplacian matrixes, and L is one symmetrical, positive semi-definite matrix.
(5) local retaining projection (LPP) method:It is the linear brief of LE algorithms, LPP calculates transition matrix P, and higher-dimension is empty
Between input sample X=[x1,x2,...,xn] project to a lower-dimensional subspace so that input can be retained in this space
The partial structurtes of sample, this conversion P can be obtained by following formula:
Wherein yi=PTxi, weight matrix S be by neighbour's figure be configured to Lai.
A generalized eigenvalue Solve problems are ultimately converted to, it is as follows:
XLXTP=λ XDXTP (6)
Wherein Dii=∑jS (i, j), D are diagonal matrixs, and L=D-S.
For the brief method of traditional dimension, come with some shortcomings part, and such as principal component analysis (PCA) method, its theory is complete
It is kind, and calculate effective, it is that the data set of linear structure has a good dimensionality reduction effect to internal structure, but in face of linearly not
During the data that can divide, PCA can not response data non-linear nature;And Multidimensional Scaling algorithm (MDS) can not be good
Handle the sample data of nonlinear organization.For the prevalence study brief method of dimension, Isometric Maps method (ISOMAP) is although can be with
Reflect the inwardness of nonlinear data, but it is similar to MDS, is all based on the dimension-reduction algorithm of the overall situation, this method is not examined
Consider the local relation of data sample, and in dimensionality reduction, ISOMAP may be produced " errors of Elbow " phenomenons;Draw pula
This Feature Mapping (LE) and local retaining projection (LPP) do not account for sample although sample local message can be handled
Classification information.
The content of the invention
The technical problems to be solved by the invention are:It is used for software defect there is provided one kind in view of the shortcomings of the prior art pre-
The method and system of survey, the semi-supervised Laplacian Eigenmap method (CSSLE) based on cost-sensitive can be with by this method
It is effectively improved the distinctive of feature extraction.
To solve above technical problem, the present invention will take following technical scheme:
The present invention proposes a kind of Software Defects Predict Methods, comprises the following steps:
Step 1: training sample set is carried out into dimension-reduction treatment, the training sample data collection for projecting to lower dimensional space, tool are obtained
Body includes:
(1) sample in sample set is divided into marked sample and unmarked sample, wherein further to marked sample
Defective sample and zero defect sample are divided into, three class adjacent maps are then built respectively, is specifically:
For first kind adjacent map, using all samples in sample set as the adjacent map node, if two knots
Point belongs to similar sample and neighbour then sets up connection side;
For Equations of The Second Kind adjacent map, using all samples in sample set as the adjacent map node, if two knots
Point belongs to foreign peoples's sample and neighbour then sets up connection side;
For the 3rd class adjacent map, using all samples in sample set as the adjacent map node, if two knots
Point belongs to unmarked sample and neighbour then sets up connection side;
(2) for every kind of adjacent map, the distance between sample point weight is determined according to the connection between node, wherein
For Equations of The Second Kind adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
(3) using the principle of laplacian eigenmaps algorithm, the distance between sample point determined according to step (2) power
The distance between sample point after weight and mapping sets up object function, and the object function is converted into generalized eigenvalue equation,
Solve the equation and obtain eigenvectors matrix, further obtain the sample set for projecting to lower dimensional space;
Step 2: treat test sample collection carries out dimension-reduction treatment according to the flow of step one, the test specimens after dimensionality reduction are obtained
Notebook data collection;
Step 3: by Naive Bayes Classifier, the training sample data collection and step 2 obtained according to step one is obtained
The test sample data set obtained, trains forecast model and predicts the classification situation of test sample data set, show that software defect is pre-
Survey result.
Further, Software Defects Predict Methods of the invention, in (1) step by step of step one:
Training sample set X={ xi, l }, wherein xiRepresent training sample, xi∈Rd, d is the dimension of training sample, i=1,
2 ... n, n are the quantity of sample, and l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 expression
Zero defect sample, 1 represents defective sample;Similar sample represents that two nodes are defective sample or are zero defect samples
This, foreign peoples's sample represents that two nodes one are that defective one, sample is zero defect sample.
Further, Software Defects Predict Methods of the invention, in (2) step by step of step one between determination sample point
Distance weighting be specially:
For first kind adjacent map, if node i, j have side connection, then weightOtherwise Wij=0;T is heat
Core width;
For Equations of The Second Kind adjacent map, if node i, j have side connection, then weightOtherwise Bij=0;Wherein
Ca,bFor cost-sensitive parameter;
For the 3rd class adjacent map, if node i, j have side connection, then weightOtherwise Sij=0.
Further, Software Defects Predict Methods of the invention, (3) step by step of step one are specific as follows:
A, set up object function:Assuming that y=[y1,y2,...,yn] be the sample for projecting to lower dimensional space, then need to solve
Following maximization problems:
Wherein, α represents regulation parameter, yi、yjSample point after expression mapping, i=1,2 ... n, j=1,2 ... n;
B, the Solve problems that the object function in step A is converted into generalized eigenvalue:LBA=λ LTa;Should by solving
Formula, obtains matrix A={ a1,a2,...,ar, wherein, λ represents characteristic value;LB=DB-B、LT=DT-T;I.e. Wherein,That is DBWith DTIt is diagonal matrix and DBWith DTEach diagonal element point
It is not every a line or each row sum in B and T;B is the weight matrix that Equations of The Second Kind adjacent map is built, and Τ=W+ α S, W are the
The weight matrix that one class adjacent map is built, S is the weight matrix that the 3rd class adjacent map is built;
C, obtain the sample y after projection according to matrix A, y is the matrix of each row vector composition in matrix A, i.e. yiIt is
A the i-th row vector,Wherein, r represents the quantity of characteristic vector in matrix A,Represent vector fjI-th
Individual component, j=1,2 ... r.
Further, Software Defects Predict Methods of the invention, in (1) step by step of step one, according between node
Neighbour's situation set up connection side be use ε fields method, i.e.,:If node i, j satisfaction | | xi-xj||2< ε, then in node i, j
Between connect a side, ε is setting value.
Further, Software Defects Predict Methods of the invention, in (1) step by step of step one, according between node
Neighbour's situation set up connection side be use n nearest neighbor methods, i.e.,:When the n neighbours node or node j that node i is node j are knots
Point i n neighbour's nodes, then node i, j is connected using a line.
Further, Software Defects Predict Methods of the invention, in (2) step by step of step one between determination sample point
Distance weighting be specially:
To the adjacent map of the first kind, the 3rd class, if node i, j have side connection, its weight is disposed as 1, otherwise
It is disposed as 0;
To the adjacent map of Equations of The Second Kind, if node i, j have side connection, then weight is set to Ca,b, 0 is otherwise set to, its
In:Ca,bIt is cost-sensitive parameter.
Further, Software Defects Predict Methods of the invention, foregoing cost-sensitive parameter Ca,bRepresent that a classes sample is wrong
It is mistakenly classified as the cost of b class samples, Ca,bTo test setting value, wherein, when a class samples refer to defective sample, b class samples refer to
Zero defect sample;Or when a class samples refer to zero defect sample, b class samples refer to defective sample.
The present invention also proposes a kind of software defect forecasting system, including:Data preprocessing module, dimension-reduction treatment module, instruction
Practice prediction module,
Data preprocessing module, for obtaining training sample data collection and test sample data set, by the sample in sample set
Originally it is divided into marked sample and unmarked sample, wherein being further divided into defective sample and zero defect sample to marked sample
This;
Dimension-reduction treatment module, for sample set to be carried out into dimension-reduction treatment, obtains the sample data set for projecting to lower dimensional space;
Train prediction module, for by Naive Bayes Classifier, by the training sample data collection after dimension-reduction treatment and
Test sample data set, trains forecast model and predicts the classification situation of test sample data set, draws software defect prediction knot
Really;
Wherein, dimension-reduction treatment module is further specifically included:
Adjacent map construction unit, for building three class adjacent maps, is specifically included:
First construction unit, for building first kind adjacent map, be specially:It regard all samples in sample set as the neighbour
The node of map interlinking, if two nodes belong to similar sample and neighbour then set up connection side;
Second construction unit, for building Equations of The Second Kind adjacent map, be specially:It regard all samples in sample set as the neighbour
The node of map interlinking, if two nodes belong to foreign peoples's sample and neighbour then set up connection side;
3rd construction unit, for building the 3rd class adjacent map, be specially:It regard all samples in sample set as the neighbour
The node of map interlinking, if two nodes belong to unmarked sample and neighbour then set up connection side;
Distance weighting computing unit, for for every kind of adjacent map, sample point to be determined according to the connection between node
The distance between weight, wherein for the second adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
Laplacian eigenmaps unit, for the principle using laplacian eigenmaps algorithm, according to distance weighting
The distance between sample point after the distance between the sample point that computing unit is determined weight and mapping sets up object function, will
The object function is converted into generalized eigenvalue equation, solves the equation and obtains eigenvectors matrix, further acquisition projects to low
The sample set of dimension space.
As the above-mentioned further refinement scheme of software defect forecasting system, distance weighting computing unit includes:
First computing unit, for calculating the distance between sample point weight in first kind adjacent map, be specially:If node
I, j have side connection, then weightOtherwise Wij=0;T is thermonuclear width;
Second computing unit, for calculating the distance between sample point weight in Equations of The Second Kind adjacent map, be specially:If node
I, j have side connection, then weightOtherwise Bij=0;Wherein Ca,bFor cost-sensitive parameter;
3rd computing unit, for calculating the distance between sample point weight in the 3rd class adjacent map, be specially:If node
I, j have side connection, then weightOtherwise Sij=0.
The present invention is using the technical scheme of the above, and the present invention is compared to the prior art, it is possible to achieve following beneficial effect:
Semi-supervised learning is applied in Laplacian Eigenmap method by the present invention in feature extraction, can not only
Retain the local neighbor structure of sample, it is brief to sample progress dimension, remove the feature of redundancy, additionally it is possible to utilize the class in sample
Other information, marked sample data and unmarked sample data are jointly processed by, and improve the distinguishing ability of forecast model.Meanwhile,
In order to avoid inhomogeneous sample is mapped in less low-dimensional neighborhood, defective sample is especially mapped to zero defect sample
In this neighborhood, LE algorithms calculate sample point apart from when introduce cost-sensitive information, LE mapping accuracy is improved with this.
Experimental verification on the NASA databases validity of institute's extracting method of the present invention, and compared with other control methods, in classification
It is had a certain upgrade in performance.
Brief description of the drawings
Fig. 1 is the method flow schematic diagram of the present invention.
Embodiment
Technical scheme is explained below with reference to accompanying drawing.
As shown in figure 1, the present invention comprises the following steps:
First, adjacent map is built using training sample set X:Sample in X is divided into defective sample, zero defect sample and nothing
Marker samples are, it is necessary to build three kinds of adjacent maps a, b, c.
For adjacent map a, using all samples in sample set as the adjacent map node, if two nodes belong to
Similar sample and neighbour then set up connection side;
For adjacent map b, using all samples in sample set as the adjacent map node, if two nodes belong to
Foreign peoples's sample and neighbour then set up connection side;
For adjacent map c, using all samples in sample set as the adjacent map node, if two nodes belong to
Unmarked sample and neighbour then set up connection side.
There are two methods to can specify that neighbour's situation of two nodes:
(a) ε fields, if node i, j satisfaction | | xi-xj||2< ε, then can connect a side between node i, j.
(b) n arest neighbors, node i, j can use a line connection to be node j n neighbours node or node as node i
J is node i n neighbour's nodes, and this relation is symmetrical.
2nd, weight is selected:To figure a, if node i, j have side connection, thenOtherwise Wij=0;To figure b, plus
Enter cost-sensitive parameter Ca,bIf node i, j have side connection, thenOtherwise Bij=0;To figure c, if node i,
J has side connection, thenOtherwise Sij=0.E is nature to the truth of a matter.
3rd, object function is set up:Assuming that y=[y1,y2,...,yk] be the sample for projecting to lower dimensional space, then need to solve
Following maximization problems:
The 4th, object function in step 3 is convertible into the Solve problems of generalized eigenvalue:LBA=λ LTa;By solving
The formula, can obtain matrix A={ a1,a2,...,ar, the vector in A be take first r obtained by generalized eigenvalue equation it is maximum
Characteristic vector corresponding to characteristic value.
5th, the sample y after projection can be obtained according to A, wherein,Wherein, r represents feature in matrix A
The quantity of vector,Represent vector fjI-th of component, j=1,2 ... r.
6th, can be brief to test sample collection Z progress dimensions according to step one to step 5, draw the sample set after dimensionality reduction
For z.
7th, by naive Bayesian (NB) grader, the data that the data set y obtained according to step 5 and step 6 are obtained
Collect z, train forecast model and predict z classification situation, draw prediction effect.
Step (1) is into (7):X={ xi, l }, i=1,2 ..., n, wherein training sample xi∈Rd, d is training sample
Dimension, l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 expression zero defect sample, 1 expression
Defective sample;α represents regulation parameter;After matrix A is obtained, so that it may which it is each row vector in A to obtain the matrix y after projection
The matrix of composition, i.e. yiIt is A the i-th row vector.
Illustrate the principle of the present invention in detail further below:
1. build adjacent map
If sample set X={ xi, l }, i=1, wherein 2 ..., n, xi∈Rd, d is the dimension of sample, and l is the classification of sample
Label, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 represents zero defect sample, and 1 represents defective sample, respectively structure
Build three kinds of adjacent maps a, b, c.In adjacent map a, using all samples in sample set as the adjacent map node, if two
Individual node belongs to similar sample and neighbour then sets up connection side;In adjacent map b, all samples in sample set are regard as the neighbour
The node of map interlinking, if two nodes belong to foreign peoples's sample and neighbour then set up connection side;In adjacent map c, by sample set
In all samples as the adjacent map node, if two nodes belong to unmarked sample and neighbour then sets up connection
Side.
There are two methods to can specify that neighbour's situation of two nodes.(a) ε fields, if i, j are met | | xi-xj||2<
ε, then can connect a side between node i, j.(b) n arest neighbors, node i, j can use a line connection when node i is node
J n neighbours node or node j is node i n neighbour's nodes, and this relation is symmetrical;The present invention uses n arest neighbors sides
Method.
2. selecting weight, there are two methods to be used for selecting the weight on side:
(a) thermonuclear is used, if node i, j have side connection, then weight isOtherwise it is 0.
(b) simple definition, if node i, j have side connection, then weight is 1, is otherwise 0.This simple method
Thermonuclear width t selection can be avoided.
In addition, in software defect prediction, there is Type Ⅰ Ⅱ error classification:I classes be defective sample mistake is categorized as it is intact
Fall into sample;II classes are that zero defect sample mistake is categorized as into defective sample.In software engineering practice, the cost of I class mistakes
It is greater than II classes.The I class mistake costs for carrying out producing during classification prediction to software module are represented by C1,0, II class mistake costs
It is represented by C0,1.C is understood according to above-mentioned analysis1,0>C0,1.The present invention makes full use of the cost information of sample, in order to avoid in drop
Foreign peoples's sample point mapping apart from each other is embedded into a less neighborhood during dimension.Because all has side connection
Node be inequality sample, so introducing cost-sensitive factor C when building the distance weighting of Equations of The Second Kind adjacent mapa,b, to improve
The mapping accuracy of this method.Wherein, Ca,bA class sample mistakes are categorized as the cost of b class samples, C by expressiona,bFor experiment setting
Value.A classes, b classes refer to defective sample or zero defect sample, a!=b, i.e.,:A classes refer to that defective sample, b classes refer to nothing
Defect sample, or a classes refer to that zero defect sample, b classes refer to defective sample.
Then three kinds of weight definitions are as follows:
3. construct object function
In order that it is also neighbour to obtain in the space of neighbour's sample in the projected in higher dimensional space, that is, keep sample
Near-neighbor Structure information, and similar sample is tried one's best separation close proximity to, foreign peoples's sample, it is assumed that y=[y1,y2,...,yk] it is to throw
Shadow then needs to solve following maximization problems to the sample of lower dimensional space:
4. obtain the sample y after projection
Make Τij=Wij+αSij, object function can be converted to following function and solved:
Wherein,
Formula (11) can do following derivation,
Wherein,That is DBWith DTIt is diagonal matrix and DBWith DTEach diagonal element difference
It is every a line or each row sum in B and T.That is LB=DB- B, LT=DT- T, so,
Formula (11) is convertible into the solution of following object function:
The Solve problems that formula (14) can be converted into generalized eigenvalue are as follows:
LBA=λ LTa (15)。
By solving (15), matrix A={ a can be obtained1,a2,...,ar, the vector in A is to take generalized eigenvalue equation
Characteristic vector corresponding to the preceding r eigenvalue of maximum of gained.Sample y after projection can be obtained according to A,
Wherein,Represent vector fjI-th of component.
As with a kind of software defect forecasting system proposed by the present invention, including:Data preprocessing module, dimension-reduction treatment mould
Block, training prediction module, wherein dimension-reduction treatment module are further specifically included:Adjacent map construction unit, distance weighting calculate single
Member, laplacian eigenmaps unit, the software defect forecasting system that the embodiment of the present invention is provided is and aforementioned software defect
Forecasting Methodology is corresponding, specific operation principle and using process referring to the related content in preceding method embodiment, herein
Repeat no more.
Method of the present invention is tested on NASA databases, and by experimental result and correlated characteristic extracting method,
As PCA, LE, LPP method are analyzed.
NASA databases contain 10 soft project collection, each engineering be collected in one of NASA space agencies of the U.S. it is soft
Part system or sub-project.In this paper experiment, we will select in the database 5 engineering collection (including CM1,
MW1, PC1, PC3, PC4) test.
Method | CM1 | MW1 | PC1 | PC3 | PC4 |
PCA | 0.23 | 0.29 | 0.23 | 0.28 | 0.24 |
LE | 0.33 | 0.36 | 0.32 | 0.34 | 0.44 |
LPP | 0.38 | 0.43 | 0.29 | 0.40 | 0.46 |
CSSLE | 0.52 | 0.52 | 0.49 | 0.48 | 0.45 |
The Pd average values of all methods of table 1 respectively in 5 engineerings of NASA data sets
Method | CM1 | MW1 | PC1 | PC3 | PC4 |
PCA | 0.03 | 0.07 | 0.03 | 0.14 | 0.02 |
LE | 0.06 | 0.05 | 0.02 | 0.07 | 0.10 |
LPP | 0.05 | 0.07 | 0.04 | 0.16 | 0.13 |
CSSLE | 0.08 | 0.06 | 0.06 | 0.04 | 0.07 |
The Pf average values of all methods of table 2 respectively in 5 engineerings of NASA data sets
Method | CM1 | MW1 | PC1 | PC3 | PC4 |
PCA | 0.33 | 0.31 | 0.29 | 0.25 | 0.35 |
LE | 0.39 | 0.41 | 0.42 | 0.37 | 0.42 |
LPP | 0.44 | 0.43 | 0.34 | 0.32 | 0.41 |
CSSLE | 0.51 | 0.52 | 0.47 | 0.54 | 0.48 |
The F-measure average values of all methods of table 3 respectively in 5 engineerings of NASA data sets
Carefully study the result obtained by three above form, it can be deduced that following points conclusion:
(1) prediction effect of method proposed by the present invention is generally greater than institute in Pd and F-measure the two indexs
There is control methods, be generally less than all control methods in Pf this index, illustrate the prediction effect of method proposed by the present invention
Fruit is generally better than all control methods.
(2) the CSSLE methods that the present invention is carried are compared with LE methods increases significantly, and this explanation introduces semi-supervised learning
Thought with cost-sensitive is effective, by semi-supervised learning and cost-sensitive, can cause the sample after feature extraction
With more distinctive, so as to improve classifying quality.
Described above is only some embodiments of the present invention, and the present invention is not only applied to field of software engineering, should referred to
Go out, for those skilled in the art, under the premise without departing from the principles of the invention, can also make some
Improvements and modifications, in addition to software, for the higher sample of other dimensions, such as recognition of face, palmprint image etc., this method
It is equally applicable, it also should be regarded as protection scope of the present invention.
Claims (10)
1. a kind of Software Defects Predict Methods, it is characterised in that comprise the following steps:
Step 1: training sample set is carried out into dimension-reduction treatment, the training sample data collection for projecting to lower dimensional space, specific bag are obtained
Include:
(1) sample in sample set is divided into marked sample and unmarked sample, wherein further to marked sample divide
For defective sample and zero defect sample, three class adjacent maps are then built respectively, are specifically:
For first kind adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to
Connection side is then set up in similar sample and neighbour;
For Equations of The Second Kind adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to
Connection side is then set up in foreign peoples's sample and neighbour;
For the 3rd class adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to
Connection side is then set up in unmarked sample and neighbour;
(2) for every kind of adjacent map, the distance between sample point weight is determined according to the connection between node, wherein for
Equations of The Second Kind adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
(3) using the principle of laplacian eigenmaps algorithm, the distance between sample point weight determined according to step (2) with
And the distance between the sample point after mapping sets up object function, and the object function is converted into generalized eigenvalue equation, solve
The equation obtains eigenvectors matrix, further obtains the sample set for projecting to lower dimensional space;
Step 2: treat test sample collection carries out dimension-reduction treatment according to the flow of step one, the test sample number after dimensionality reduction is obtained
According to collection;
Step 3: by Naive Bayes Classifier, being obtained according to the training sample data collection and step 2 of step one acquisition
Test sample data set, trains forecast model and predicts the classification situation of test sample data set, draws software defect prediction knot
Really.
2. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one:
Training sample set X={ xi, l }, wherein xiRepresent training sample, xi∈Rd, d is the dimension of training sample, i=1,2,
... n, n are the quantity of sample, and l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 indicate without
Defect sample, 1 represents defective sample;Similar sample represents that two nodes are defective sample or are zero defect samples,
Foreign peoples's sample represents that two nodes one are that defective one, sample is zero defect sample.
3. Software Defects Predict Methods according to claim 1, it is characterised in that in (2) step by step of step one really
Determining the distance between sample point weight is specially:
For first kind adjacent map, if node i, j have side connection, then weightOtherwise Wij=0;T is that thermonuclear is wide
Degree;
For Equations of The Second Kind adjacent map, if node i, j have side connection, then weightOtherwise Bij=0;Wherein Ca,b
For cost-sensitive parameter;
For the 3rd class adjacent map, if node i, j have side connection, then weightOtherwise Sij=0.
4. Software Defects Predict Methods according to claim 3, it is characterised in that (3) step by step of step one are specific such as
Under:
A, set up object function:Assuming that y=[y1,y2,...,yn] be the sample for projecting to lower dimensional space, then need solution following
Maximization problems:
Wherein, α represents regulation parameter, yi、yjSample point after expression mapping, i=1,2 ... n, j=1,2 ... n;
B, the Solve problems that the object function in step A is converted into generalized eigenvalue:LBA=λ LTa;By solving the formula, ask
Go out matrix A={ a1,a2,...,ar, wherein, λ represents characteristic value;LB=DB-B、LT=DT-T;I.e. Wherein,That is DBWith DTIt is diagonal matrix and DBWith DTEach diagonal element point
It is not every a line or each row sum in B and T;B is the weight matrix that Equations of The Second Kind adjacent map is built, and Τ=W+ α S, W are the
The weight matrix that one class adjacent map is built, S is the weight matrix that the 3rd class adjacent map is built;
C, obtain the sample y after projection according to matrix A, y is the matrix of each row vector composition in matrix A, i.e. yiIt is the of A
I row vectors,Wherein, r represents the quantity of characteristic vector in matrix A,Represent vector fjI-th point
Amount, j=1,2 ... r.
5. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one, root
It is to use ε fields method to set up connection side according to neighbour's situation between node, i.e.,:If node i, j satisfaction | | xi-xj||2< ε, then
A side is connected between node i, j, ε is setting value.
6. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one, root
It is to use n nearest neighbor methods to set up connection side according to neighbour's situation between node, i.e.,:When node i be node j n neighbours node or
Person's node j is node i n neighbour's nodes, then node i, and j is connected using a line.
7. Software Defects Predict Methods according to claim 1, it is characterised in that in (2) step by step of step one really
Determining the distance between sample point weight is specially:
To the adjacent map of the first kind, the 3rd class, if node i, j have side connection, its weight is disposed as 1, is otherwise all provided with
It is set to 0;
To the adjacent map of Equations of The Second Kind, if node i, j have side connection, then weight is set to Ca,b, 0 is otherwise set to, wherein:Ca,b
It is cost-sensitive parameter.
8. the Software Defects Predict Methods according to claim 3 or 7, it is characterised in that cost-sensitive parameter Ca,bRepresenting will
A class sample mistakes are categorized as the cost of b class samples, Ca,bTo test setting value, wherein, when a class samples refer to defective sample, b
Class sample refers to zero defect sample;Or when a class samples refer to zero defect sample, b class samples refer to defective sample.
9. a kind of software defect forecasting system, it is characterised in that including:Data preprocessing module, dimension-reduction treatment module, training are pre-
Survey module,
Data preprocessing module, for obtaining training sample data collection and test sample data set, by the sample in sample set point
For marked sample and unmarked sample, wherein being further divided into defective sample and zero defect sample to marked sample;
Dimension-reduction treatment module, for sample set to be carried out into dimension-reduction treatment, obtains the sample data set for projecting to lower dimensional space;
Prediction module is trained, for by Naive Bayes Classifier, by the training sample data collection after dimension-reduction treatment and test
Sample data set, trains forecast model and predicts the classification situation of test sample data set, show that software defect predicts the outcome;
Wherein, dimension-reduction treatment module is further specifically included:
Adjacent map construction unit, for building three class adjacent maps, is specifically included:
First construction unit, for building first kind adjacent map, be specially:It regard all samples in sample set as the adjacent map
Node, if two nodes belong to similar sample and neighbour then set up connection side;
Second construction unit, for building Equations of The Second Kind adjacent map, be specially:It regard all samples in sample set as the adjacent map
Node, if two nodes belong to foreign peoples's sample and neighbour then set up connection side;
3rd construction unit, for building the 3rd class adjacent map, be specially:It regard all samples in sample set as the adjacent map
Node, if two nodes belong to unmarked sample and neighbour then set up connection side;
Distance weighting computing unit, for for every kind of adjacent map, is determined between sample point according to the connection between node
Distance weighting, wherein for the second adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
Laplacian eigenmaps unit, for the principle using laplacian eigenmaps algorithm, is calculated according to distance weighting
The distance between sample point after the distance between the sample point that unit is determined weight and mapping sets up object function, by the mesh
Scalar functions are converted into generalized eigenvalue equation, solve the equation and obtain eigenvectors matrix, further obtain and project to low-dimensional sky
Between sample set.
10. software defect forecasting system according to claim 9, it is characterised in that distance weighting computing unit includes:
First computing unit, for calculating the distance between sample point weight in first kind adjacent map, be specially:If node i, j
There is side connection, then weightOtherwise Wij=0;T is thermonuclear width;
Second computing unit, for calculating the distance between sample point weight in Equations of The Second Kind adjacent map, be specially:If node i, j
There is side connection, then weightOtherwise Bij=0;Wherein Ca,bFor cost-sensitive parameter;
3rd computing unit, for calculating the distance between sample point weight in the 3rd class adjacent map, be specially:If node i, j
There is side connection, then weightOtherwise Sij=0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710212286.3A CN106991049B (en) | 2017-04-01 | 2017-04-01 | Software defect prediction method and prediction system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710212286.3A CN106991049B (en) | 2017-04-01 | 2017-04-01 | Software defect prediction method and prediction system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106991049A true CN106991049A (en) | 2017-07-28 |
CN106991049B CN106991049B (en) | 2020-10-27 |
Family
ID=59414965
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710212286.3A Active CN106991049B (en) | 2017-04-01 | 2017-04-01 | Software defect prediction method and prediction system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106991049B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255712A (en) * | 2017-12-29 | 2018-07-06 | 曙光信息产业(北京)有限公司 | The test system and test method of data system |
CN108334455A (en) * | 2018-03-05 | 2018-07-27 | 清华大学 | The Software Defects Predict Methods and system of cost-sensitive hypergraph study based on search |
CN108446711A (en) * | 2018-02-01 | 2018-08-24 | 南京邮电大学 | A kind of Software Defects Predict Methods based on transfer learning |
CN109933538A (en) * | 2019-04-02 | 2019-06-25 | 广东石油化工学院 | A kind of real-time bug prediction model enhancing frame towards cost perception |
CN110008584A (en) * | 2019-04-02 | 2019-07-12 | 广东石油化工学院 | A kind of semi-supervised heterogeneous software failure prediction algorithm based on GitHub |
CN111143212A (en) * | 2019-12-24 | 2020-05-12 | 中国航空工业集团公司西安飞机设计研究所 | Functional logic function library verification method under module integrated software architecture |
CN112306730A (en) * | 2020-11-12 | 2021-02-02 | 南通大学 | Defect report severity prediction method based on historical item pseudo label generation |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103234767B (en) * | 2013-04-21 | 2016-01-06 | 苏州科技学院 | Based on the nonlinear fault detection method of semi-supervised manifold learning |
CN103559401B (en) * | 2013-11-08 | 2016-06-22 | 渤海大学 | Failure monitoring method based on semi-supervised pivot analysis |
CN105426923A (en) * | 2015-12-14 | 2016-03-23 | 北京科技大学 | Semi-supervised classification method and system |
-
2017
- 2017-04-01 CN CN201710212286.3A patent/CN106991049B/en active Active
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108255712B (en) * | 2017-12-29 | 2021-05-14 | 曙光信息产业(北京)有限公司 | Test system and test method of data system |
CN108255712A (en) * | 2017-12-29 | 2018-07-06 | 曙光信息产业(北京)有限公司 | The test system and test method of data system |
CN108446711A (en) * | 2018-02-01 | 2018-08-24 | 南京邮电大学 | A kind of Software Defects Predict Methods based on transfer learning |
CN108446711B (en) * | 2018-02-01 | 2022-04-22 | 南京邮电大学 | Software defect prediction method based on transfer learning |
CN108334455A (en) * | 2018-03-05 | 2018-07-27 | 清华大学 | The Software Defects Predict Methods and system of cost-sensitive hypergraph study based on search |
CN108334455B (en) * | 2018-03-05 | 2020-06-26 | 清华大学 | Software defect prediction method and system based on search cost-sensitive hypergraph learning |
CN109933538A (en) * | 2019-04-02 | 2019-06-25 | 广东石油化工学院 | A kind of real-time bug prediction model enhancing frame towards cost perception |
CN110008584A (en) * | 2019-04-02 | 2019-07-12 | 广东石油化工学院 | A kind of semi-supervised heterogeneous software failure prediction algorithm based on GitHub |
CN109933538B (en) * | 2019-04-02 | 2020-04-28 | 广东石油化工学院 | Cost perception-oriented real-time defect prediction model enhancement method |
WO2020199345A1 (en) * | 2019-04-02 | 2020-10-08 | 广东石油化工学院 | Semi-supervised and heterogeneous software defect prediction algorithm employing github |
CN111143212A (en) * | 2019-12-24 | 2020-05-12 | 中国航空工业集团公司西安飞机设计研究所 | Functional logic function library verification method under module integrated software architecture |
CN111143212B (en) * | 2019-12-24 | 2023-06-23 | 中国航空工业集团公司西安飞机设计研究所 | Functional logic function library verification method under module integrated software architecture |
CN112306730B (en) * | 2020-11-12 | 2021-11-30 | 南通大学 | Defect report severity prediction method based on historical item pseudo label generation |
CN112306730A (en) * | 2020-11-12 | 2021-02-02 | 南通大学 | Defect report severity prediction method based on historical item pseudo label generation |
Also Published As
Publication number | Publication date |
---|---|
CN106991049B (en) | 2020-10-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106991049A (en) | A kind of Software Defects Predict Methods and forecasting system | |
CN107273490B (en) | Combined wrong question recommendation method based on knowledge graph | |
WO2019149147A1 (en) | Method for dividing ecological and geological environment types based on coal resource development | |
CN111882446B (en) | Abnormal account detection method based on graph convolution network | |
CN108764228A (en) | Word object detection method in a kind of image | |
CN107679465A (en) | A kind of pedestrian's weight identification data generation and extending method based on generation network | |
CN105260738A (en) | Method and system for detecting change of high-resolution remote sensing image based on active learning | |
CN104573669A (en) | Image object detection method | |
CN105825511A (en) | Image background definition detection method based on deep learning | |
CN108629367A (en) | A method of clothes Attribute Recognition precision is enhanced based on depth network | |
CN109992779A (en) | A kind of sentiment analysis method, apparatus, equipment and storage medium based on CNN | |
CN106529605A (en) | Image identification method of convolutional neural network model based on immunity theory | |
CN103258214A (en) | Remote sensing image classification method based on image block active learning | |
CN111368690A (en) | Deep learning-based video image ship detection method and system under influence of sea waves | |
CN104751469B (en) | The image partition method clustered based on Fuzzy c-means | |
CN106127197A (en) | A kind of saliency object detection method based on notable tag sorting | |
CN103440512A (en) | Identifying method of brain cognitive states based on tensor locality preserving projection | |
CN105469063A (en) | Robust human face image principal component feature extraction method and identification apparatus | |
CN108830301A (en) | The semi-supervised data classification method of double Laplace regularizations based on anchor graph structure | |
CN103617609B (en) | Based on k-means non-linearity manifold cluster and the representative point choosing method of graph theory | |
CN116416478B (en) | Bioinformatics classification model based on graph structure data characteristics | |
CN110084812A (en) | A kind of terahertz image defect inspection method, device, system and storage medium | |
CN116012722A (en) | Remote sensing image scene classification method | |
CN111598854B (en) | Segmentation method for small defects of complex textures based on rich robust convolution feature model | |
CN104966075A (en) | Face recognition method and system based on two-dimensional discriminant features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |