CN106991049A - A kind of Software Defects Predict Methods and forecasting system - Google Patents

A kind of Software Defects Predict Methods and forecasting system Download PDF

Info

Publication number
CN106991049A
CN106991049A CN201710212286.3A CN201710212286A CN106991049A CN 106991049 A CN106991049 A CN 106991049A CN 201710212286 A CN201710212286 A CN 201710212286A CN 106991049 A CN106991049 A CN 106991049A
Authority
CN
China
Prior art keywords
sample
node
adjacent map
weight
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710212286.3A
Other languages
Chinese (zh)
Other versions
CN106991049B (en
Inventor
史雪静
荆晓远
岳东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201710212286.3A priority Critical patent/CN106991049B/en
Publication of CN106991049A publication Critical patent/CN106991049A/en
Application granted granted Critical
Publication of CN106991049B publication Critical patent/CN106991049B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/3604Software analysis for verifying properties of programs
    • G06F11/3608Software analysis for verifying properties of programs using formal methods, e.g. model checking, abstract interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/192Recognition using electronic means using simultaneous comparisons or correlations of the image signals with a plurality of references
    • G06V30/194References adjustable by an adaptive method, e.g. learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of Software Defects Predict Methods, this method will have category label and sample without category label to be jointly processed by, semi-supervised learning is used in laplacian eigenmaps (LE), improve LE methods, simultaneously, in order to avoid inhomogeneous sample is mapped in less low-dimensional neighborhood, especially defective sample is mapped in zero defect sample neighborhood, LE algorithms calculate sample point apart from when introduce cost-sensitive information, LE mapping accuracy is improved with this, the distinctive of feature extraction can be effectively improved by this method.The present invention also proposes a kind of software defect forecasting system, and the present invention is applied on NASA databases, experiments verify that the validity of institute's extracting method, and compared with other control methods, it is had a certain upgrade on classification performance.

Description

A kind of Software Defects Predict Methods and forecasting system
Technical field
The present invention relates to a kind of Software Defects Predict Methods and forecasting system, belong to field of software engineering.
Background technology
Software defect prediction includes data prediction, feature extraction, training forecast model, identification four processes.It is wherein special Levy one of the problem of extraction is most basic during software defect is predicted.For software defect prediction, extraction is characterized in effectively Complete the top priority of identification.
Existing feature extracting method can be divided into the brief method of traditional dimension and the popular study brief method of dimension.Wherein, Traditional brief method of dimension:Including principal component analysis (PCA), Multidimensional Scaling (MDS).The prevalence study brief method of dimension: Including Isometric Maps method (ISOMAP), laplacian eigenmaps (LE), local retaining projection (LPP) etc..
(1) principal component analysis (PCA):Its core concept is by raw sample data Linear Mapping to lower dimensional space In so that the data after projection have the unrelated characteristic of each characteristic line, after projection, higher-dimension in new lower dimensional space Data can be mapped as the data of low-dimensional, so as to realize data reduction.Finally it is required to meet the v of following formula:
StV=λ v (1)
Wherein StOverall Scatter Matrix is represented, λ is the corresponding characteristic values of v.
(2) Multidimensional Scaling (MDS):By analyzing set of metadata of similar data come the concealed structure information in mining data, MDS is calculated The purpose of method is the sample that lower dimensional space is reconstructed between known original sample in the case of distance so that in the sample of lower dimensional space This distance and original sample is as identical as possible in the distance of higher dimensional space.The distance of lower dimensional space reconstructed sample and higher-dimension are original The error of sample distance is represented using an error function, by solving the error function, can obtain the number after mapping According to.Following formula is solved, the matrix Z after mapping is drawn:
Wherein, Z={ zi,li, i=1 ..., n, di,j=d (xi,xj)=(xi-xj)T(xi-xj)。
(3) Isometric Maps method (ISOMAP):It is that MDS one kind is improved, in ISOMAP, is substituted using geodesic distance Euclidean distance in MDS, this mode can obtain optimal geometry in the overall situation, can preserve low-dimensional popular structure. The solution of geodesic distance is divided into two kinds of situations to consider, a kind of situation is the as Neighbor Points if two samples are apart from close, then Directly carry out approximate geodesic distance using Euclidean distance;Another situation is the as non-near adjoint point if two samples are distant, The shortest path on neighbour's figure can be then used to represent.After geodesic distance is obtained, it is possible to carried out using MDS methods etc. Away from mapping, represented so as to obtain low-dimensional data.
ISOMAP can reflect the inwardness of nonlinear data, but this method does not account for the part of data sample Relation, and in dimensionality reduction, ISOMAP may produce the error of " Elbow " phenomenon, and the main cause for causing the shortcoming is exactly Geodesic distance is measured.
(4) laplacian eigenmaps (LE):The neighbor relationships between data are constructed from local so that in higher dimensional space Local structural information can be kept in the mapped, and LE algorithms are more visual and understandable when constructing dimensionality reduction target.
In order that it is also neighbour to obtain in the space of neighbour's sample in the projected in higher dimensional space, that is, keep sample Near-neighbor Structure information, it is assumed that y=[y1,y2,...,yk] be the sample for projecting to lower dimensional space, then what LE algorithms needed solution is Following minimization problem:
Wherein, W is weight matrix.Above formula can be finally converted to the solution of generalized eigenvalue problem:
Lf=λ Df (4)
Wherein, D is diagonal matrix, and its element is the summation of summation either W every a line of W each row, i.e., Dii=∑jWji, L=D-W is Laplacian matrixes, and L is one symmetrical, positive semi-definite matrix.
(5) local retaining projection (LPP) method:It is the linear brief of LE algorithms, LPP calculates transition matrix P, and higher-dimension is empty Between input sample X=[x1,x2,...,xn] project to a lower-dimensional subspace so that input can be retained in this space The partial structurtes of sample, this conversion P can be obtained by following formula:
Wherein yi=PTxi, weight matrix S be by neighbour's figure be configured to Lai.
A generalized eigenvalue Solve problems are ultimately converted to, it is as follows:
XLXTP=λ XDXTP (6)
Wherein Dii=∑jS (i, j), D are diagonal matrixs, and L=D-S.
For the brief method of traditional dimension, come with some shortcomings part, and such as principal component analysis (PCA) method, its theory is complete It is kind, and calculate effective, it is that the data set of linear structure has a good dimensionality reduction effect to internal structure, but in face of linearly not During the data that can divide, PCA can not response data non-linear nature;And Multidimensional Scaling algorithm (MDS) can not be good Handle the sample data of nonlinear organization.For the prevalence study brief method of dimension, Isometric Maps method (ISOMAP) is although can be with Reflect the inwardness of nonlinear data, but it is similar to MDS, is all based on the dimension-reduction algorithm of the overall situation, this method is not examined Consider the local relation of data sample, and in dimensionality reduction, ISOMAP may be produced " errors of Elbow " phenomenons;Draw pula This Feature Mapping (LE) and local retaining projection (LPP) do not account for sample although sample local message can be handled Classification information.
The content of the invention
The technical problems to be solved by the invention are:It is used for software defect there is provided one kind in view of the shortcomings of the prior art pre- The method and system of survey, the semi-supervised Laplacian Eigenmap method (CSSLE) based on cost-sensitive can be with by this method It is effectively improved the distinctive of feature extraction.
To solve above technical problem, the present invention will take following technical scheme:
The present invention proposes a kind of Software Defects Predict Methods, comprises the following steps:
Step 1: training sample set is carried out into dimension-reduction treatment, the training sample data collection for projecting to lower dimensional space, tool are obtained Body includes:
(1) sample in sample set is divided into marked sample and unmarked sample, wherein further to marked sample Defective sample and zero defect sample are divided into, three class adjacent maps are then built respectively, is specifically:
For first kind adjacent map, using all samples in sample set as the adjacent map node, if two knots Point belongs to similar sample and neighbour then sets up connection side;
For Equations of The Second Kind adjacent map, using all samples in sample set as the adjacent map node, if two knots Point belongs to foreign peoples's sample and neighbour then sets up connection side;
For the 3rd class adjacent map, using all samples in sample set as the adjacent map node, if two knots Point belongs to unmarked sample and neighbour then sets up connection side;
(2) for every kind of adjacent map, the distance between sample point weight is determined according to the connection between node, wherein For Equations of The Second Kind adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
(3) using the principle of laplacian eigenmaps algorithm, the distance between sample point determined according to step (2) power The distance between sample point after weight and mapping sets up object function, and the object function is converted into generalized eigenvalue equation, Solve the equation and obtain eigenvectors matrix, further obtain the sample set for projecting to lower dimensional space;
Step 2: treat test sample collection carries out dimension-reduction treatment according to the flow of step one, the test specimens after dimensionality reduction are obtained Notebook data collection;
Step 3: by Naive Bayes Classifier, the training sample data collection and step 2 obtained according to step one is obtained The test sample data set obtained, trains forecast model and predicts the classification situation of test sample data set, show that software defect is pre- Survey result.
Further, Software Defects Predict Methods of the invention, in (1) step by step of step one:
Training sample set X={ xi, l }, wherein xiRepresent training sample, xi∈Rd, d is the dimension of training sample, i=1, 2 ... n, n are the quantity of sample, and l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 expression Zero defect sample, 1 represents defective sample;Similar sample represents that two nodes are defective sample or are zero defect samples This, foreign peoples's sample represents that two nodes one are that defective one, sample is zero defect sample.
Further, Software Defects Predict Methods of the invention, in (2) step by step of step one between determination sample point Distance weighting be specially:
For first kind adjacent map, if node i, j have side connection, then weightOtherwise Wij=0;T is heat Core width;
For Equations of The Second Kind adjacent map, if node i, j have side connection, then weightOtherwise Bij=0;Wherein Ca,bFor cost-sensitive parameter;
For the 3rd class adjacent map, if node i, j have side connection, then weightOtherwise Sij=0.
Further, Software Defects Predict Methods of the invention, (3) step by step of step one are specific as follows:
A, set up object function:Assuming that y=[y1,y2,...,yn] be the sample for projecting to lower dimensional space, then need to solve Following maximization problems:
Wherein, α represents regulation parameter, yi、yjSample point after expression mapping, i=1,2 ... n, j=1,2 ... n;
B, the Solve problems that the object function in step A is converted into generalized eigenvalue:LBA=λ LTa;Should by solving Formula, obtains matrix A={ a1,a2,...,ar, wherein, λ represents characteristic value;LB=DB-B、LT=DT-T;I.e. Wherein,That is DBWith DTIt is diagonal matrix and DBWith DTEach diagonal element point It is not every a line or each row sum in B and T;B is the weight matrix that Equations of The Second Kind adjacent map is built, and Τ=W+ α S, W are the The weight matrix that one class adjacent map is built, S is the weight matrix that the 3rd class adjacent map is built;
C, obtain the sample y after projection according to matrix A, y is the matrix of each row vector composition in matrix A, i.e. yiIt is A the i-th row vector,Wherein, r represents the quantity of characteristic vector in matrix A,Represent vector fjI-th Individual component, j=1,2 ... r.
Further, Software Defects Predict Methods of the invention, in (1) step by step of step one, according between node Neighbour's situation set up connection side be use ε fields method, i.e.,:If node i, j satisfaction | | xi-xj||2< ε, then in node i, j Between connect a side, ε is setting value.
Further, Software Defects Predict Methods of the invention, in (1) step by step of step one, according between node Neighbour's situation set up connection side be use n nearest neighbor methods, i.e.,:When the n neighbours node or node j that node i is node j are knots Point i n neighbour's nodes, then node i, j is connected using a line.
Further, Software Defects Predict Methods of the invention, in (2) step by step of step one between determination sample point Distance weighting be specially:
To the adjacent map of the first kind, the 3rd class, if node i, j have side connection, its weight is disposed as 1, otherwise It is disposed as 0;
To the adjacent map of Equations of The Second Kind, if node i, j have side connection, then weight is set to Ca,b, 0 is otherwise set to, its In:Ca,bIt is cost-sensitive parameter.
Further, Software Defects Predict Methods of the invention, foregoing cost-sensitive parameter Ca,bRepresent that a classes sample is wrong It is mistakenly classified as the cost of b class samples, Ca,bTo test setting value, wherein, when a class samples refer to defective sample, b class samples refer to Zero defect sample;Or when a class samples refer to zero defect sample, b class samples refer to defective sample.
The present invention also proposes a kind of software defect forecasting system, including:Data preprocessing module, dimension-reduction treatment module, instruction Practice prediction module,
Data preprocessing module, for obtaining training sample data collection and test sample data set, by the sample in sample set Originally it is divided into marked sample and unmarked sample, wherein being further divided into defective sample and zero defect sample to marked sample This;
Dimension-reduction treatment module, for sample set to be carried out into dimension-reduction treatment, obtains the sample data set for projecting to lower dimensional space;
Train prediction module, for by Naive Bayes Classifier, by the training sample data collection after dimension-reduction treatment and Test sample data set, trains forecast model and predicts the classification situation of test sample data set, draws software defect prediction knot Really;
Wherein, dimension-reduction treatment module is further specifically included:
Adjacent map construction unit, for building three class adjacent maps, is specifically included:
First construction unit, for building first kind adjacent map, be specially:It regard all samples in sample set as the neighbour The node of map interlinking, if two nodes belong to similar sample and neighbour then set up connection side;
Second construction unit, for building Equations of The Second Kind adjacent map, be specially:It regard all samples in sample set as the neighbour The node of map interlinking, if two nodes belong to foreign peoples's sample and neighbour then set up connection side;
3rd construction unit, for building the 3rd class adjacent map, be specially:It regard all samples in sample set as the neighbour The node of map interlinking, if two nodes belong to unmarked sample and neighbour then set up connection side;
Distance weighting computing unit, for for every kind of adjacent map, sample point to be determined according to the connection between node The distance between weight, wherein for the second adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
Laplacian eigenmaps unit, for the principle using laplacian eigenmaps algorithm, according to distance weighting The distance between sample point after the distance between the sample point that computing unit is determined weight and mapping sets up object function, will The object function is converted into generalized eigenvalue equation, solves the equation and obtains eigenvectors matrix, further acquisition projects to low The sample set of dimension space.
As the above-mentioned further refinement scheme of software defect forecasting system, distance weighting computing unit includes:
First computing unit, for calculating the distance between sample point weight in first kind adjacent map, be specially:If node I, j have side connection, then weightOtherwise Wij=0;T is thermonuclear width;
Second computing unit, for calculating the distance between sample point weight in Equations of The Second Kind adjacent map, be specially:If node I, j have side connection, then weightOtherwise Bij=0;Wherein Ca,bFor cost-sensitive parameter;
3rd computing unit, for calculating the distance between sample point weight in the 3rd class adjacent map, be specially:If node I, j have side connection, then weightOtherwise Sij=0.
The present invention is using the technical scheme of the above, and the present invention is compared to the prior art, it is possible to achieve following beneficial effect:
Semi-supervised learning is applied in Laplacian Eigenmap method by the present invention in feature extraction, can not only Retain the local neighbor structure of sample, it is brief to sample progress dimension, remove the feature of redundancy, additionally it is possible to utilize the class in sample Other information, marked sample data and unmarked sample data are jointly processed by, and improve the distinguishing ability of forecast model.Meanwhile, In order to avoid inhomogeneous sample is mapped in less low-dimensional neighborhood, defective sample is especially mapped to zero defect sample In this neighborhood, LE algorithms calculate sample point apart from when introduce cost-sensitive information, LE mapping accuracy is improved with this. Experimental verification on the NASA databases validity of institute's extracting method of the present invention, and compared with other control methods, in classification It is had a certain upgrade in performance.
Brief description of the drawings
Fig. 1 is the method flow schematic diagram of the present invention.
Embodiment
Technical scheme is explained below with reference to accompanying drawing.
As shown in figure 1, the present invention comprises the following steps:
First, adjacent map is built using training sample set X:Sample in X is divided into defective sample, zero defect sample and nothing Marker samples are, it is necessary to build three kinds of adjacent maps a, b, c.
For adjacent map a, using all samples in sample set as the adjacent map node, if two nodes belong to Similar sample and neighbour then set up connection side;
For adjacent map b, using all samples in sample set as the adjacent map node, if two nodes belong to Foreign peoples's sample and neighbour then set up connection side;
For adjacent map c, using all samples in sample set as the adjacent map node, if two nodes belong to Unmarked sample and neighbour then set up connection side.
There are two methods to can specify that neighbour's situation of two nodes:
(a) ε fields, if node i, j satisfaction | | xi-xj||2< ε, then can connect a side between node i, j.
(b) n arest neighbors, node i, j can use a line connection to be node j n neighbours node or node as node i J is node i n neighbour's nodes, and this relation is symmetrical.
2nd, weight is selected:To figure a, if node i, j have side connection, thenOtherwise Wij=0;To figure b, plus Enter cost-sensitive parameter Ca,bIf node i, j have side connection, thenOtherwise Bij=0;To figure c, if node i, J has side connection, thenOtherwise Sij=0.E is nature to the truth of a matter.
3rd, object function is set up:Assuming that y=[y1,y2,...,yk] be the sample for projecting to lower dimensional space, then need to solve Following maximization problems:
The 4th, object function in step 3 is convertible into the Solve problems of generalized eigenvalue:LBA=λ LTa;By solving The formula, can obtain matrix A={ a1,a2,...,ar, the vector in A be take first r obtained by generalized eigenvalue equation it is maximum Characteristic vector corresponding to characteristic value.
5th, the sample y after projection can be obtained according to A, wherein,Wherein, r represents feature in matrix A The quantity of vector,Represent vector fjI-th of component, j=1,2 ... r.
6th, can be brief to test sample collection Z progress dimensions according to step one to step 5, draw the sample set after dimensionality reduction For z.
7th, by naive Bayesian (NB) grader, the data that the data set y obtained according to step 5 and step 6 are obtained Collect z, train forecast model and predict z classification situation, draw prediction effect.
Step (1) is into (7):X={ xi, l }, i=1,2 ..., n, wherein training sample xi∈Rd, d is training sample Dimension, l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 expression zero defect sample, 1 expression Defective sample;α represents regulation parameter;After matrix A is obtained, so that it may which it is each row vector in A to obtain the matrix y after projection The matrix of composition, i.e. yiIt is A the i-th row vector.
Illustrate the principle of the present invention in detail further below:
1. build adjacent map
If sample set X={ xi, l }, i=1, wherein 2 ..., n, xi∈Rd, d is the dimension of sample, and l is the classification of sample Label, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 represents zero defect sample, and 1 represents defective sample, respectively structure Build three kinds of adjacent maps a, b, c.In adjacent map a, using all samples in sample set as the adjacent map node, if two Individual node belongs to similar sample and neighbour then sets up connection side;In adjacent map b, all samples in sample set are regard as the neighbour The node of map interlinking, if two nodes belong to foreign peoples's sample and neighbour then set up connection side;In adjacent map c, by sample set In all samples as the adjacent map node, if two nodes belong to unmarked sample and neighbour then sets up connection Side.
There are two methods to can specify that neighbour's situation of two nodes.(a) ε fields, if i, j are met | | xi-xj||2< ε, then can connect a side between node i, j.(b) n arest neighbors, node i, j can use a line connection when node i is node J n neighbours node or node j is node i n neighbour's nodes, and this relation is symmetrical;The present invention uses n arest neighbors sides Method.
2. selecting weight, there are two methods to be used for selecting the weight on side:
(a) thermonuclear is used, if node i, j have side connection, then weight isOtherwise it is 0.
(b) simple definition, if node i, j have side connection, then weight is 1, is otherwise 0.This simple method Thermonuclear width t selection can be avoided.
In addition, in software defect prediction, there is Type Ⅰ Ⅱ error classification:I classes be defective sample mistake is categorized as it is intact Fall into sample;II classes are that zero defect sample mistake is categorized as into defective sample.In software engineering practice, the cost of I class mistakes It is greater than II classes.The I class mistake costs for carrying out producing during classification prediction to software module are represented by C1,0, II class mistake costs It is represented by C0,1.C is understood according to above-mentioned analysis1,0>C0,1.The present invention makes full use of the cost information of sample, in order to avoid in drop Foreign peoples's sample point mapping apart from each other is embedded into a less neighborhood during dimension.Because all has side connection Node be inequality sample, so introducing cost-sensitive factor C when building the distance weighting of Equations of The Second Kind adjacent mapa,b, to improve The mapping accuracy of this method.Wherein, Ca,bA class sample mistakes are categorized as the cost of b class samples, C by expressiona,bFor experiment setting Value.A classes, b classes refer to defective sample or zero defect sample, a!=b, i.e.,:A classes refer to that defective sample, b classes refer to nothing Defect sample, or a classes refer to that zero defect sample, b classes refer to defective sample.
Then three kinds of weight definitions are as follows:
3. construct object function
In order that it is also neighbour to obtain in the space of neighbour's sample in the projected in higher dimensional space, that is, keep sample Near-neighbor Structure information, and similar sample is tried one's best separation close proximity to, foreign peoples's sample, it is assumed that y=[y1,y2,...,yk] it is to throw Shadow then needs to solve following maximization problems to the sample of lower dimensional space:
4. obtain the sample y after projection
Make Τij=Wij+αSij, object function can be converted to following function and solved:
Wherein,
Formula (11) can do following derivation,
Wherein,That is DBWith DTIt is diagonal matrix and DBWith DTEach diagonal element difference It is every a line or each row sum in B and T.That is LB=DB- B, LT=DT- T, so, Formula (11) is convertible into the solution of following object function:
The Solve problems that formula (14) can be converted into generalized eigenvalue are as follows:
LBA=λ LTa (15)。
By solving (15), matrix A={ a can be obtained1,a2,...,ar, the vector in A is to take generalized eigenvalue equation Characteristic vector corresponding to the preceding r eigenvalue of maximum of gained.Sample y after projection can be obtained according to A, Wherein,Represent vector fjI-th of component.
As with a kind of software defect forecasting system proposed by the present invention, including:Data preprocessing module, dimension-reduction treatment mould Block, training prediction module, wherein dimension-reduction treatment module are further specifically included:Adjacent map construction unit, distance weighting calculate single Member, laplacian eigenmaps unit, the software defect forecasting system that the embodiment of the present invention is provided is and aforementioned software defect Forecasting Methodology is corresponding, specific operation principle and using process referring to the related content in preceding method embodiment, herein Repeat no more.
Method of the present invention is tested on NASA databases, and by experimental result and correlated characteristic extracting method, As PCA, LE, LPP method are analyzed.
NASA databases contain 10 soft project collection, each engineering be collected in one of NASA space agencies of the U.S. it is soft Part system or sub-project.In this paper experiment, we will select in the database 5 engineering collection (including CM1, MW1, PC1, PC3, PC4) test.
Method CM1 MW1 PC1 PC3 PC4
PCA 0.23 0.29 0.23 0.28 0.24
LE 0.33 0.36 0.32 0.34 0.44
LPP 0.38 0.43 0.29 0.40 0.46
CSSLE 0.52 0.52 0.49 0.48 0.45
The Pd average values of all methods of table 1 respectively in 5 engineerings of NASA data sets
Method CM1 MW1 PC1 PC3 PC4
PCA 0.03 0.07 0.03 0.14 0.02
LE 0.06 0.05 0.02 0.07 0.10
LPP 0.05 0.07 0.04 0.16 0.13
CSSLE 0.08 0.06 0.06 0.04 0.07
The Pf average values of all methods of table 2 respectively in 5 engineerings of NASA data sets
Method CM1 MW1 PC1 PC3 PC4
PCA 0.33 0.31 0.29 0.25 0.35
LE 0.39 0.41 0.42 0.37 0.42
LPP 0.44 0.43 0.34 0.32 0.41
CSSLE 0.51 0.52 0.47 0.54 0.48
The F-measure average values of all methods of table 3 respectively in 5 engineerings of NASA data sets
Carefully study the result obtained by three above form, it can be deduced that following points conclusion:
(1) prediction effect of method proposed by the present invention is generally greater than institute in Pd and F-measure the two indexs There is control methods, be generally less than all control methods in Pf this index, illustrate the prediction effect of method proposed by the present invention Fruit is generally better than all control methods.
(2) the CSSLE methods that the present invention is carried are compared with LE methods increases significantly, and this explanation introduces semi-supervised learning Thought with cost-sensitive is effective, by semi-supervised learning and cost-sensitive, can cause the sample after feature extraction With more distinctive, so as to improve classifying quality.
Described above is only some embodiments of the present invention, and the present invention is not only applied to field of software engineering, should referred to Go out, for those skilled in the art, under the premise without departing from the principles of the invention, can also make some Improvements and modifications, in addition to software, for the higher sample of other dimensions, such as recognition of face, palmprint image etc., this method It is equally applicable, it also should be regarded as protection scope of the present invention.

Claims (10)

1. a kind of Software Defects Predict Methods, it is characterised in that comprise the following steps:
Step 1: training sample set is carried out into dimension-reduction treatment, the training sample data collection for projecting to lower dimensional space, specific bag are obtained Include:
(1) sample in sample set is divided into marked sample and unmarked sample, wherein further to marked sample divide For defective sample and zero defect sample, three class adjacent maps are then built respectively, are specifically:
For first kind adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to Connection side is then set up in similar sample and neighbour;
For Equations of The Second Kind adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to Connection side is then set up in foreign peoples's sample and neighbour;
For the 3rd class adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to Connection side is then set up in unmarked sample and neighbour;
(2) for every kind of adjacent map, the distance between sample point weight is determined according to the connection between node, wherein for Equations of The Second Kind adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
(3) using the principle of laplacian eigenmaps algorithm, the distance between sample point weight determined according to step (2) with And the distance between the sample point after mapping sets up object function, and the object function is converted into generalized eigenvalue equation, solve The equation obtains eigenvectors matrix, further obtains the sample set for projecting to lower dimensional space;
Step 2: treat test sample collection carries out dimension-reduction treatment according to the flow of step one, the test sample number after dimensionality reduction is obtained According to collection;
Step 3: by Naive Bayes Classifier, being obtained according to the training sample data collection and step 2 of step one acquisition Test sample data set, trains forecast model and predicts the classification situation of test sample data set, draws software defect prediction knot Really.
2. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one:
Training sample set X={ xi, l }, wherein xiRepresent training sample, xi∈Rd, d is the dimension of training sample, i=1,2, ... n, n are the quantity of sample, and l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 indicate without Defect sample, 1 represents defective sample;Similar sample represents that two nodes are defective sample or are zero defect samples, Foreign peoples's sample represents that two nodes one are that defective one, sample is zero defect sample.
3. Software Defects Predict Methods according to claim 1, it is characterised in that in (2) step by step of step one really Determining the distance between sample point weight is specially:
For first kind adjacent map, if node i, j have side connection, then weightOtherwise Wij=0;T is that thermonuclear is wide Degree;
For Equations of The Second Kind adjacent map, if node i, j have side connection, then weightOtherwise Bij=0;Wherein Ca,b For cost-sensitive parameter;
For the 3rd class adjacent map, if node i, j have side connection, then weightOtherwise Sij=0.
4. Software Defects Predict Methods according to claim 3, it is characterised in that (3) step by step of step one are specific such as Under:
A, set up object function:Assuming that y=[y1,y2,...,yn] be the sample for projecting to lower dimensional space, then need solution following Maximization problems:
m a x Σ i j | | y i - y j | | 2 B i j Σ i j | | y i - y j | | 2 W i j + α Σ i j | | y i - y j | | 2 S i j ;
Wherein, α represents regulation parameter, yi、yjSample point after expression mapping, i=1,2 ... n, j=1,2 ... n;
B, the Solve problems that the object function in step A is converted into generalized eigenvalue:LBA=λ LTa;By solving the formula, ask Go out matrix A={ a1,a2,...,ar, wherein, λ represents characteristic value;LB=DB-B、LT=DT-T;I.e. Wherein,That is DBWith DTIt is diagonal matrix and DBWith DTEach diagonal element point It is not every a line or each row sum in B and T;B is the weight matrix that Equations of The Second Kind adjacent map is built, and Τ=W+ α S, W are the The weight matrix that one class adjacent map is built, S is the weight matrix that the 3rd class adjacent map is built;
C, obtain the sample y after projection according to matrix A, y is the matrix of each row vector composition in matrix A, i.e. yiIt is the of A I row vectors,Wherein, r represents the quantity of characteristic vector in matrix A,Represent vector fjI-th point Amount, j=1,2 ... r.
5. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one, root It is to use ε fields method to set up connection side according to neighbour's situation between node, i.e.,:If node i, j satisfaction | | xi-xj||2< ε, then A side is connected between node i, j, ε is setting value.
6. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one, root It is to use n nearest neighbor methods to set up connection side according to neighbour's situation between node, i.e.,:When node i be node j n neighbours node or Person's node j is node i n neighbour's nodes, then node i, and j is connected using a line.
7. Software Defects Predict Methods according to claim 1, it is characterised in that in (2) step by step of step one really Determining the distance between sample point weight is specially:
To the adjacent map of the first kind, the 3rd class, if node i, j have side connection, its weight is disposed as 1, is otherwise all provided with It is set to 0;
To the adjacent map of Equations of The Second Kind, if node i, j have side connection, then weight is set to Ca,b, 0 is otherwise set to, wherein:Ca,b It is cost-sensitive parameter.
8. the Software Defects Predict Methods according to claim 3 or 7, it is characterised in that cost-sensitive parameter Ca,bRepresenting will A class sample mistakes are categorized as the cost of b class samples, Ca,bTo test setting value, wherein, when a class samples refer to defective sample, b Class sample refers to zero defect sample;Or when a class samples refer to zero defect sample, b class samples refer to defective sample.
9. a kind of software defect forecasting system, it is characterised in that including:Data preprocessing module, dimension-reduction treatment module, training are pre- Survey module,
Data preprocessing module, for obtaining training sample data collection and test sample data set, by the sample in sample set point For marked sample and unmarked sample, wherein being further divided into defective sample and zero defect sample to marked sample;
Dimension-reduction treatment module, for sample set to be carried out into dimension-reduction treatment, obtains the sample data set for projecting to lower dimensional space;
Prediction module is trained, for by Naive Bayes Classifier, by the training sample data collection after dimension-reduction treatment and test Sample data set, trains forecast model and predicts the classification situation of test sample data set, show that software defect predicts the outcome;
Wherein, dimension-reduction treatment module is further specifically included:
Adjacent map construction unit, for building three class adjacent maps, is specifically included:
First construction unit, for building first kind adjacent map, be specially:It regard all samples in sample set as the adjacent map Node, if two nodes belong to similar sample and neighbour then set up connection side;
Second construction unit, for building Equations of The Second Kind adjacent map, be specially:It regard all samples in sample set as the adjacent map Node, if two nodes belong to foreign peoples's sample and neighbour then set up connection side;
3rd construction unit, for building the 3rd class adjacent map, be specially:It regard all samples in sample set as the adjacent map Node, if two nodes belong to unmarked sample and neighbour then set up connection side;
Distance weighting computing unit, for for every kind of adjacent map, is determined between sample point according to the connection between node Distance weighting, wherein for the second adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting;
Laplacian eigenmaps unit, for the principle using laplacian eigenmaps algorithm, is calculated according to distance weighting The distance between sample point after the distance between the sample point that unit is determined weight and mapping sets up object function, by the mesh Scalar functions are converted into generalized eigenvalue equation, solve the equation and obtain eigenvectors matrix, further obtain and project to low-dimensional sky Between sample set.
10. software defect forecasting system according to claim 9, it is characterised in that distance weighting computing unit includes:
First computing unit, for calculating the distance between sample point weight in first kind adjacent map, be specially:If node i, j There is side connection, then weightOtherwise Wij=0;T is thermonuclear width;
Second computing unit, for calculating the distance between sample point weight in Equations of The Second Kind adjacent map, be specially:If node i, j There is side connection, then weightOtherwise Bij=0;Wherein Ca,bFor cost-sensitive parameter;
3rd computing unit, for calculating the distance between sample point weight in the 3rd class adjacent map, be specially:If node i, j There is side connection, then weightOtherwise Sij=0.
CN201710212286.3A 2017-04-01 2017-04-01 Software defect prediction method and prediction system Active CN106991049B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710212286.3A CN106991049B (en) 2017-04-01 2017-04-01 Software defect prediction method and prediction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710212286.3A CN106991049B (en) 2017-04-01 2017-04-01 Software defect prediction method and prediction system

Publications (2)

Publication Number Publication Date
CN106991049A true CN106991049A (en) 2017-07-28
CN106991049B CN106991049B (en) 2020-10-27

Family

ID=59414965

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710212286.3A Active CN106991049B (en) 2017-04-01 2017-04-01 Software defect prediction method and prediction system

Country Status (1)

Country Link
CN (1) CN106991049B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255712A (en) * 2017-12-29 2018-07-06 曙光信息产业(北京)有限公司 The test system and test method of data system
CN108334455A (en) * 2018-03-05 2018-07-27 清华大学 The Software Defects Predict Methods and system of cost-sensitive hypergraph study based on search
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN109933538A (en) * 2019-04-02 2019-06-25 广东石油化工学院 A kind of real-time bug prediction model enhancing frame towards cost perception
CN110008584A (en) * 2019-04-02 2019-07-12 广东石油化工学院 A kind of semi-supervised heterogeneous software failure prediction algorithm based on GitHub
CN111143212A (en) * 2019-12-24 2020-05-12 中国航空工业集团公司西安飞机设计研究所 Functional logic function library verification method under module integrated software architecture
CN112306730A (en) * 2020-11-12 2021-02-02 南通大学 Defect report severity prediction method based on historical item pseudo label generation

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103234767B (en) * 2013-04-21 2016-01-06 苏州科技学院 Based on the nonlinear fault detection method of semi-supervised manifold learning
CN103559401B (en) * 2013-11-08 2016-06-22 渤海大学 Failure monitoring method based on semi-supervised pivot analysis
CN105426923A (en) * 2015-12-14 2016-03-23 北京科技大学 Semi-supervised classification method and system

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108255712B (en) * 2017-12-29 2021-05-14 曙光信息产业(北京)有限公司 Test system and test method of data system
CN108255712A (en) * 2017-12-29 2018-07-06 曙光信息产业(北京)有限公司 The test system and test method of data system
CN108446711A (en) * 2018-02-01 2018-08-24 南京邮电大学 A kind of Software Defects Predict Methods based on transfer learning
CN108446711B (en) * 2018-02-01 2022-04-22 南京邮电大学 Software defect prediction method based on transfer learning
CN108334455A (en) * 2018-03-05 2018-07-27 清华大学 The Software Defects Predict Methods and system of cost-sensitive hypergraph study based on search
CN108334455B (en) * 2018-03-05 2020-06-26 清华大学 Software defect prediction method and system based on search cost-sensitive hypergraph learning
CN109933538A (en) * 2019-04-02 2019-06-25 广东石油化工学院 A kind of real-time bug prediction model enhancing frame towards cost perception
CN110008584A (en) * 2019-04-02 2019-07-12 广东石油化工学院 A kind of semi-supervised heterogeneous software failure prediction algorithm based on GitHub
CN109933538B (en) * 2019-04-02 2020-04-28 广东石油化工学院 Cost perception-oriented real-time defect prediction model enhancement method
WO2020199345A1 (en) * 2019-04-02 2020-10-08 广东石油化工学院 Semi-supervised and heterogeneous software defect prediction algorithm employing github
CN111143212A (en) * 2019-12-24 2020-05-12 中国航空工业集团公司西安飞机设计研究所 Functional logic function library verification method under module integrated software architecture
CN111143212B (en) * 2019-12-24 2023-06-23 中国航空工业集团公司西安飞机设计研究所 Functional logic function library verification method under module integrated software architecture
CN112306730B (en) * 2020-11-12 2021-11-30 南通大学 Defect report severity prediction method based on historical item pseudo label generation
CN112306730A (en) * 2020-11-12 2021-02-02 南通大学 Defect report severity prediction method based on historical item pseudo label generation

Also Published As

Publication number Publication date
CN106991049B (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN106991049A (en) A kind of Software Defects Predict Methods and forecasting system
CN107273490B (en) Combined wrong question recommendation method based on knowledge graph
WO2019149147A1 (en) Method for dividing ecological and geological environment types based on coal resource development
CN111882446B (en) Abnormal account detection method based on graph convolution network
CN108764228A (en) Word object detection method in a kind of image
CN107679465A (en) A kind of pedestrian's weight identification data generation and extending method based on generation network
CN105260738A (en) Method and system for detecting change of high-resolution remote sensing image based on active learning
CN104573669A (en) Image object detection method
CN105825511A (en) Image background definition detection method based on deep learning
CN108629367A (en) A method of clothes Attribute Recognition precision is enhanced based on depth network
CN109992779A (en) A kind of sentiment analysis method, apparatus, equipment and storage medium based on CNN
CN106529605A (en) Image identification method of convolutional neural network model based on immunity theory
CN103258214A (en) Remote sensing image classification method based on image block active learning
CN111368690A (en) Deep learning-based video image ship detection method and system under influence of sea waves
CN104751469B (en) The image partition method clustered based on Fuzzy c-means
CN106127197A (en) A kind of saliency object detection method based on notable tag sorting
CN103440512A (en) Identifying method of brain cognitive states based on tensor locality preserving projection
CN105469063A (en) Robust human face image principal component feature extraction method and identification apparatus
CN108830301A (en) The semi-supervised data classification method of double Laplace regularizations based on anchor graph structure
CN103617609B (en) Based on k-means non-linearity manifold cluster and the representative point choosing method of graph theory
CN116416478B (en) Bioinformatics classification model based on graph structure data characteristics
CN110084812A (en) A kind of terahertz image defect inspection method, device, system and storage medium
CN116012722A (en) Remote sensing image scene classification method
CN111598854B (en) Segmentation method for small defects of complex textures based on rich robust convolution feature model
CN104966075A (en) Face recognition method and system based on two-dimensional discriminant features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant