CN106991049A

CN106991049A - A kind of Software Defects Predict Methods and forecasting system

Info

Publication number: CN106991049A
Application number: CN201710212286.3A
Authority: CN
Inventors: 史雪静; 荆晓远; 岳东
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2017-04-01
Filing date: 2017-04-01
Publication date: 2017-07-28
Anticipated expiration: 2037-04-01
Also published as: CN106991049B

Abstract

The invention discloses a kind of Software Defects Predict Methods, this method will have category label and sample without category label to be jointly processed by, semi-supervised learning is used in laplacian eigenmaps (LE), improve LE methods, simultaneously, in order to avoid inhomogeneous sample is mapped in less low-dimensional neighborhood, especially defective sample is mapped in zero defect sample neighborhood, LE algorithms calculate sample point apart from when introduce cost-sensitive information, LE mapping accuracy is improved with this, the distinctive of feature extraction can be effectively improved by this method.The present invention also proposes a kind of software defect forecasting system, and the present invention is applied on NASA databases, experiments verify that the validity of institute's extracting method, and compared with other control methods, it is had a certain upgrade on classification performance.

Description

A kind of Software Defects Predict Methods and forecasting system

Technical field

The present invention relates to a kind of Software Defects Predict Methods and forecasting system, belong to field of software engineering.

Background technology

Software defect prediction includes data prediction, feature extraction, training forecast model, identification four processes.It is wherein special Levy one of the problem of extraction is most basic during software defect is predicted.For software defect prediction, extraction is characterized in effectively Complete the top priority of identification.

Existing feature extracting method can be divided into the brief method of traditional dimension and the popular study brief method of dimension.Wherein, Traditional brief method of dimension：Including principal component analysis (PCA), Multidimensional Scaling (MDS).The prevalence study brief method of dimension： Including Isometric Maps method (ISOMAP), laplacian eigenmaps (LE), local retaining projection (LPP) etc..

(1) principal component analysis (PCA)：Its core concept is by raw sample data Linear Mapping to lower dimensional space In so that the data after projection have the unrelated characteristic of each characteristic line, after projection, higher-dimension in new lower dimensional space Data can be mapped as the data of low-dimensional, so as to realize data reduction.Finally it is required to meet the v of following formula：

S_tV=λ v (1)

Wherein S_tOverall Scatter Matrix is represented, λ is the corresponding characteristic values of v.

(2) Multidimensional Scaling (MDS)：By analyzing set of metadata of similar data come the concealed structure information in mining data, MDS is calculated The purpose of method is the sample that lower dimensional space is reconstructed between known original sample in the case of distance so that in the sample of lower dimensional space This distance and original sample is as identical as possible in the distance of higher dimensional space.The distance of lower dimensional space reconstructed sample and higher-dimension are original The error of sample distance is represented using an error function, by solving the error function, can obtain the number after mapping According to.Following formula is solved, the matrix Z after mapping is drawn：

Wherein, Z={ z_i,l_i, i=1 ..., n, d_i,j=d (x_i,x_j)=(x_i-x_j)^T(x_i-x_j)。

(3) Isometric Maps method (ISOMAP)：It is that MDS one kind is improved, in ISOMAP, is substituted using geodesic distance Euclidean distance in MDS, this mode can obtain optimal geometry in the overall situation, can preserve low-dimensional popular structure. The solution of geodesic distance is divided into two kinds of situations to consider, a kind of situation is the as Neighbor Points if two samples are apart from close, then Directly carry out approximate geodesic distance using Euclidean distance；Another situation is the as non-near adjoint point if two samples are distant, The shortest path on neighbour's figure can be then used to represent.After geodesic distance is obtained, it is possible to carried out using MDS methods etc. Away from mapping, represented so as to obtain low-dimensional data.

ISOMAP can reflect the inwardness of nonlinear data, but this method does not account for the part of data sample Relation, and in dimensionality reduction, ISOMAP may produce the error of " Elbow " phenomenon, and the main cause for causing the shortcoming is exactly Geodesic distance is measured.

(4) laplacian eigenmaps (LE)：The neighbor relationships between data are constructed from local so that in higher dimensional space Local structural information can be kept in the mapped, and LE algorithms are more visual and understandable when constructing dimensionality reduction target.

In order that it is also neighbour to obtain in the space of neighbour's sample in the projected in higher dimensional space, that is, keep sample Near-neighbor Structure information, it is assumed that y=[y₁,y₂,...,y_k] be the sample for projecting to lower dimensional space, then what LE algorithms needed solution is Following minimization problem：

Wherein, W is weight matrix.Above formula can be finally converted to the solution of generalized eigenvalue problem：

Lf=λ Df (4)

Wherein, D is diagonal matrix, and its element is the summation of summation either W every a line of W each row, i.e., D_ii=∑_jW_ji, L=D-W is Laplacian matrixes, and L is one symmetrical, positive semi-definite matrix.

(5) local retaining projection (LPP) method：It is the linear brief of LE algorithms, LPP calculates transition matrix P, and higher-dimension is empty Between input sample X=[x₁,x₂,...,x_n] project to a lower-dimensional subspace so that input can be retained in this space The partial structurtes of sample, this conversion P can be obtained by following formula：

Wherein y_i=P^Tx_i, weight matrix S be by neighbour's figure be configured to Lai.

A generalized eigenvalue Solve problems are ultimately converted to, it is as follows：

XLX^TP=λ XDX^TP (6)

Wherein D_ii=∑_jS (i, j), D are diagonal matrixs, and L=D-S.

For the brief method of traditional dimension, come with some shortcomings part, and such as principal component analysis (PCA) method, its theory is complete It is kind, and calculate effective, it is that the data set of linear structure has a good dimensionality reduction effect to internal structure, but in face of linearly not During the data that can divide, PCA can not response data non-linear nature；And Multidimensional Scaling algorithm (MDS) can not be good Handle the sample data of nonlinear organization.For the prevalence study brief method of dimension, Isometric Maps method (ISOMAP) is although can be with Reflect the inwardness of nonlinear data, but it is similar to MDS, is all based on the dimension-reduction algorithm of the overall situation, this method is not examined Consider the local relation of data sample, and in dimensionality reduction, ISOMAP may be produced " errors of Elbow " phenomenons；Draw pula This Feature Mapping (LE) and local retaining projection (LPP) do not account for sample although sample local message can be handled Classification information.

The content of the invention

The technical problems to be solved by the invention are：It is used for software defect there is provided one kind in view of the shortcomings of the prior art pre- The method and system of survey, the semi-supervised Laplacian Eigenmap method (CSSLE) based on cost-sensitive can be with by this method It is effectively improved the distinctive of feature extraction.

To solve above technical problem, the present invention will take following technical scheme：

The present invention proposes a kind of Software Defects Predict Methods, comprises the following steps：

Step 1: training sample set is carried out into dimension-reduction treatment, the training sample data collection for projecting to lower dimensional space, tool are obtained Body includes：

(1) sample in sample set is divided into marked sample and unmarked sample, wherein further to marked sample Defective sample and zero defect sample are divided into, three class adjacent maps are then built respectively, is specifically：

For first kind adjacent map, using all samples in sample set as the adjacent map node, if two knots Point belongs to similar sample and neighbour then sets up connection side；

For Equations of The Second Kind adjacent map, using all samples in sample set as the adjacent map node, if two knots Point belongs to foreign peoples's sample and neighbour then sets up connection side；

For the 3rd class adjacent map, using all samples in sample set as the adjacent map node, if two knots Point belongs to unmarked sample and neighbour then sets up connection side；

(2) for every kind of adjacent map, the distance between sample point weight is determined according to the connection between node, wherein For Equations of The Second Kind adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting；

(3) using the principle of laplacian eigenmaps algorithm, the distance between sample point determined according to step (2) power The distance between sample point after weight and mapping sets up object function, and the object function is converted into generalized eigenvalue equation, Solve the equation and obtain eigenvectors matrix, further obtain the sample set for projecting to lower dimensional space；

Step 2: treat test sample collection carries out dimension-reduction treatment according to the flow of step one, the test specimens after dimensionality reduction are obtained Notebook data collection；

Step 3: by Naive Bayes Classifier, the training sample data collection and step 2 obtained according to step one is obtained The test sample data set obtained, trains forecast model and predicts the classification situation of test sample data set, show that software defect is pre- Survey result.

Further, Software Defects Predict Methods of the invention, in (1) step by step of step one：

Training sample set X={ x_i, l }, wherein xⁱRepresent training sample, x_i∈R^d, d is the dimension of training sample, i=1, 2 ... n, n are the quantity of sample, and l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 expression Zero defect sample, 1 represents defective sample；Similar sample represents that two nodes are defective sample or are zero defect samples This, foreign peoples's sample represents that two nodes one are that defective one, sample is zero defect sample.

Further, Software Defects Predict Methods of the invention, in (2) step by step of step one between determination sample point Distance weighting be specially：

For first kind adjacent map, if node i, j have side connection, then weightOtherwise W_ij=0；T is heat Core width；

For Equations of The Second Kind adjacent map, if node i, j have side connection, then weightOtherwise B_ij=0；Wherein C_a,bFor cost-sensitive parameter；

For the 3rd class adjacent map, if node i, j have side connection, then weightOtherwise S_ij=0.

Further, Software Defects Predict Methods of the invention, (3) step by step of step one are specific as follows：

A, set up object function：Assuming that y=[y₁,y₂,...,y_n] be the sample for projecting to lower dimensional space, then need to solve Following maximization problems：

Wherein, α represents regulation parameter, y_i、y_jSample point after expression mapping, i=1,2 ... n, j=1,2 ... n；

B, the Solve problems that the object function in step A is converted into generalized eigenvalue：L^BA=λ L^Ta；Should by solving Formula, obtains matrix A={ a₁,a₂,...,a_r, wherein, λ represents characteristic value；L^B=D^B-B、L^T=D^T-T；I.e. Wherein,That is D^BWith D^TIt is diagonal matrix and D^BWith D^TEach diagonal element point It is not every a line or each row sum in B and T；B is the weight matrix that Equations of The Second Kind adjacent map is built, and Τ=W+ α S, W are the The weight matrix that one class adjacent map is built, S is the weight matrix that the 3rd class adjacent map is built；

C, obtain the sample y after projection according to matrix A, y is the matrix of each row vector composition in matrix A, i.e. y_iIt is A the i-th row vector,Wherein, r represents the quantity of characteristic vector in matrix A,Represent vector f_jI-th Individual component, j=1,2 ... r.

Further, Software Defects Predict Methods of the invention, in (1) step by step of step one, according between node Neighbour's situation set up connection side be use ε fields method, i.e.,：If node i, j satisfaction | | x_i-x_j||²＜ ε, then in node i, j Between connect a side, ε is setting value.

Further, Software Defects Predict Methods of the invention, in (1) step by step of step one, according between node Neighbour's situation set up connection side be use n nearest neighbor methods, i.e.,：When the n neighbours node or node j that node i is node j are knots Point i n neighbour's nodes, then node i, j is connected using a line.

To the adjacent map of the first kind, the 3rd class, if node i, j have side connection, its weight is disposed as 1, otherwise It is disposed as 0；

To the adjacent map of Equations of The Second Kind, if node i, j have side connection, then weight is set to C_a,b, 0 is otherwise set to, its In：C_a,bIt is cost-sensitive parameter.

Further, Software Defects Predict Methods of the invention, foregoing cost-sensitive parameter C_a,bRepresent that a classes sample is wrong It is mistakenly classified as the cost of b class samples, C_a,bTo test setting value, wherein, when a class samples refer to defective sample, b class samples refer to Zero defect sample；Or when a class samples refer to zero defect sample, b class samples refer to defective sample.

The present invention also proposes a kind of software defect forecasting system, including：Data preprocessing module, dimension-reduction treatment module, instruction Practice prediction module,

Data preprocessing module, for obtaining training sample data collection and test sample data set, by the sample in sample set Originally it is divided into marked sample and unmarked sample, wherein being further divided into defective sample and zero defect sample to marked sample This；

Dimension-reduction treatment module, for sample set to be carried out into dimension-reduction treatment, obtains the sample data set for projecting to lower dimensional space；

Train prediction module, for by Naive Bayes Classifier, by the training sample data collection after dimension-reduction treatment and Test sample data set, trains forecast model and predicts the classification situation of test sample data set, draws software defect prediction knot Really；

Wherein, dimension-reduction treatment module is further specifically included：

Adjacent map construction unit, for building three class adjacent maps, is specifically included：

First construction unit, for building first kind adjacent map, be specially：It regard all samples in sample set as the neighbour The node of map interlinking, if two nodes belong to similar sample and neighbour then set up connection side；

Second construction unit, for building Equations of The Second Kind adjacent map, be specially：It regard all samples in sample set as the neighbour The node of map interlinking, if two nodes belong to foreign peoples's sample and neighbour then set up connection side；

3rd construction unit, for building the 3rd class adjacent map, be specially：It regard all samples in sample set as the neighbour The node of map interlinking, if two nodes belong to unmarked sample and neighbour then set up connection side；

Distance weighting computing unit, for for every kind of adjacent map, sample point to be determined according to the connection between node The distance between weight, wherein for the second adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting；

Laplacian eigenmaps unit, for the principle using laplacian eigenmaps algorithm, according to distance weighting The distance between sample point after the distance between the sample point that computing unit is determined weight and mapping sets up object function, will The object function is converted into generalized eigenvalue equation, solves the equation and obtains eigenvectors matrix, further acquisition projects to low The sample set of dimension space.

As the above-mentioned further refinement scheme of software defect forecasting system, distance weighting computing unit includes：

First computing unit, for calculating the distance between sample point weight in first kind adjacent map, be specially：If node I, j have side connection, then weightOtherwise W_ij=0；T is thermonuclear width；

Second computing unit, for calculating the distance between sample point weight in Equations of The Second Kind adjacent map, be specially：If node I, j have side connection, then weightOtherwise B_ij=0；Wherein C_a,bFor cost-sensitive parameter；

3rd computing unit, for calculating the distance between sample point weight in the 3rd class adjacent map, be specially：If node I, j have side connection, then weightOtherwise S_ij=0.

The present invention is using the technical scheme of the above, and the present invention is compared to the prior art, it is possible to achieve following beneficial effect：

Semi-supervised learning is applied in Laplacian Eigenmap method by the present invention in feature extraction, can not only Retain the local neighbor structure of sample, it is brief to sample progress dimension, remove the feature of redundancy, additionally it is possible to utilize the class in sample Other information, marked sample data and unmarked sample data are jointly processed by, and improve the distinguishing ability of forecast model.Meanwhile, In order to avoid inhomogeneous sample is mapped in less low-dimensional neighborhood, defective sample is especially mapped to zero defect sample In this neighborhood, LE algorithms calculate sample point apart from when introduce cost-sensitive information, LE mapping accuracy is improved with this. Experimental verification on the NASA databases validity of institute's extracting method of the present invention, and compared with other control methods, in classification It is had a certain upgrade in performance.

Brief description of the drawings

Fig. 1 is the method flow schematic diagram of the present invention.

Embodiment

Technical scheme is explained below with reference to accompanying drawing.

As shown in figure 1, the present invention comprises the following steps:

First, adjacent map is built using training sample set X：Sample in X is divided into defective sample, zero defect sample and nothing Marker samples are, it is necessary to build three kinds of adjacent maps a, b, c.

For adjacent map a, using all samples in sample set as the adjacent map node, if two nodes belong to Similar sample and neighbour then set up connection side；

For adjacent map b, using all samples in sample set as the adjacent map node, if two nodes belong to Foreign peoples's sample and neighbour then set up connection side；

For adjacent map c, using all samples in sample set as the adjacent map node, if two nodes belong to Unmarked sample and neighbour then set up connection side.

There are two methods to can specify that neighbour's situation of two nodes：

(a) ε fields, if node i, j satisfaction | | x_i-x_j||²＜ ε, then can connect a side between node i, j.

(b) n arest neighbors, node i, j can use a line connection to be node j n neighbours node or node as node i J is node i n neighbour's nodes, and this relation is symmetrical.

2nd, weight is selected：To figure a, if node i, j have side connection, thenOtherwise W_ij=0；To figure b, plus Enter cost-sensitive parameter C_a,bIf node i, j have side connection, thenOtherwise B_ij=0；To figure c, if node i, J has side connection, thenOtherwise S_ij=0.E is nature to the truth of a matter.

3rd, object function is set up：Assuming that y=[y₁,y₂,...,y_k] be the sample for projecting to lower dimensional space, then need to solve Following maximization problems：

The 4th, object function in step 3 is convertible into the Solve problems of generalized eigenvalue：L^BA=λ L^Ta；By solving The formula, can obtain matrix A={ a₁,a₂,...,a_r, the vector in A be take first r obtained by generalized eigenvalue equation it is maximum Characteristic vector corresponding to characteristic value.

5th, the sample y after projection can be obtained according to A, wherein,Wherein, r represents feature in matrix A The quantity of vector,Represent vector f_jI-th of component, j=1,2 ... r.

6th, can be brief to test sample collection Z progress dimensions according to step one to step 5, draw the sample set after dimensionality reduction For z.

7th, by naive Bayesian (NB) grader, the data that the data set y obtained according to step 5 and step 6 are obtained Collect z, train forecast model and predict z classification situation, draw prediction effect.

Step (1) is into (7)：X={ x_i, l }, i=1,2 ..., n, wherein training sample x_i∈R^d, d is training sample Dimension, l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 expression zero defect sample, 1 expression Defective sample；α represents regulation parameter；After matrix A is obtained, so that it may which it is each row vector in A to obtain the matrix y after projection The matrix of composition, i.e. y_iIt is A the i-th row vector.

Illustrate the principle of the present invention in detail further below：

1. build adjacent map

If sample set X={ x_i, l }, i=1, wherein 2 ..., n, x_i∈R^d, d is the dimension of sample, and l is the classification of sample Label, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 represents zero defect sample, and 1 represents defective sample, respectively structure Build three kinds of adjacent maps a, b, c.In adjacent map a, using all samples in sample set as the adjacent map node, if two Individual node belongs to similar sample and neighbour then sets up connection side；In adjacent map b, all samples in sample set are regard as the neighbour The node of map interlinking, if two nodes belong to foreign peoples's sample and neighbour then set up connection side；In adjacent map c, by sample set In all samples as the adjacent map node, if two nodes belong to unmarked sample and neighbour then sets up connection Side.

There are two methods to can specify that neighbour's situation of two nodes.(a) ε fields, if i, j are met | | x_i-x_j||²＜ ε, then can connect a side between node i, j.(b) n arest neighbors, node i, j can use a line connection when node i is node J n neighbours node or node j is node i n neighbour's nodes, and this relation is symmetrical；The present invention uses n arest neighbors sides Method.

2. selecting weight, there are two methods to be used for selecting the weight on side：

(a) thermonuclear is used, if node i, j have side connection, then weight isOtherwise it is 0.

(b) simple definition, if node i, j have side connection, then weight is 1, is otherwise 0.This simple method Thermonuclear width t selection can be avoided.

In addition, in software defect prediction, there is Type Ⅰ Ⅱ error classification：I classes be defective sample mistake is categorized as it is intact Fall into sample；II classes are that zero defect sample mistake is categorized as into defective sample.In software engineering practice, the cost of I class mistakes It is greater than II classes.The I class mistake costs for carrying out producing during classification prediction to software module are represented by C_1,0, II class mistake costs It is represented by C_0,1.C is understood according to above-mentioned analysis_1,0>C_0,1.The present invention makes full use of the cost information of sample, in order to avoid in drop Foreign peoples's sample point mapping apart from each other is embedded into a less neighborhood during dimension.Because all has side connection Node be inequality sample, so introducing cost-sensitive factor C when building the distance weighting of Equations of The Second Kind adjacent map_a,b, to improve The mapping accuracy of this method.Wherein, C_a,bA class sample mistakes are categorized as the cost of b class samples, C by expression_a,bFor experiment setting Value.A classes, b classes refer to defective sample or zero defect sample, a！=b, i.e.,：A classes refer to that defective sample, b classes refer to nothing Defect sample, or a classes refer to that zero defect sample, b classes refer to defective sample.

Then three kinds of weight definitions are as follows：

3. construct object function

In order that it is also neighbour to obtain in the space of neighbour's sample in the projected in higher dimensional space, that is, keep sample Near-neighbor Structure information, and similar sample is tried one's best separation close proximity to, foreign peoples's sample, it is assumed that y=[y₁,y₂,...,y_k] it is to throw Shadow then needs to solve following maximization problems to the sample of lower dimensional space：

4. obtain the sample y after projection

Make Τ_ij=W_ij+αS_ij, object function can be converted to following function and solved：

Wherein,

Formula (11) can do following derivation,

Wherein,That is D^BWith D^TIt is diagonal matrix and D^BWith D^TEach diagonal element difference It is every a line or each row sum in B and T.That is L^B=D^B- B, L^T=D^T- T, so, Formula (11) is convertible into the solution of following object function：

The Solve problems that formula (14) can be converted into generalized eigenvalue are as follows：

L^BA=λ L^Ta (15)。

By solving (15), matrix A={ a can be obtained₁,a₂,...,a_r, the vector in A is to take generalized eigenvalue equation Characteristic vector corresponding to the preceding r eigenvalue of maximum of gained.Sample y after projection can be obtained according to A, Wherein,Represent vector f_jI-th of component.

As with a kind of software defect forecasting system proposed by the present invention, including：Data preprocessing module, dimension-reduction treatment mould Block, training prediction module, wherein dimension-reduction treatment module are further specifically included：Adjacent map construction unit, distance weighting calculate single Member, laplacian eigenmaps unit, the software defect forecasting system that the embodiment of the present invention is provided is and aforementioned software defect Forecasting Methodology is corresponding, specific operation principle and using process referring to the related content in preceding method embodiment, herein Repeat no more.

Method of the present invention is tested on NASA databases, and by experimental result and correlated characteristic extracting method, As PCA, LE, LPP method are analyzed.

NASA databases contain 10 soft project collection, each engineering be collected in one of NASA space agencies of the U.S. it is soft Part system or sub-project.In this paper experiment, we will select in the database 5 engineering collection (including CM1, MW1, PC1, PC3, PC4) test.

Method	CM1	MW1	PC1	PC3	PC4
						PCA	0.23	0.29	0.23	0.28	0.24
LE	0.33	0.36	0.32	0.34	0.44
						LPP	0.38	0.43	0.29	0.40	0.46
CSSLE	0.52	0.52	0.49	0.48	0.45

The Pd average values of all methods of table 1 respectively in 5 engineerings of NASA data sets

Method	CM1	MW1	PC1	PC3	PC4
						PCA	0.03	0.07	0.03	0.14	0.02
LE	0.06	0.05	0.02	0.07	0.10
						LPP	0.05	0.07	0.04	0.16	0.13
CSSLE	0.08	0.06	0.06	0.04	0.07

The Pf average values of all methods of table 2 respectively in 5 engineerings of NASA data sets

Method	CM1	MW1	PC1	PC3	PC4
						PCA	0.33	0.31	0.29	0.25	0.35
LE	0.39	0.41	0.42	0.37	0.42
						LPP	0.44	0.43	0.34	0.32	0.41
CSSLE	0.51	0.52	0.47	0.54	0.48

The F-measure average values of all methods of table 3 respectively in 5 engineerings of NASA data sets

Carefully study the result obtained by three above form, it can be deduced that following points conclusion：

(1) prediction effect of method proposed by the present invention is generally greater than institute in Pd and F-measure the two indexs There is control methods, be generally less than all control methods in Pf this index, illustrate the prediction effect of method proposed by the present invention Fruit is generally better than all control methods.

(2) the CSSLE methods that the present invention is carried are compared with LE methods increases significantly, and this explanation introduces semi-supervised learning Thought with cost-sensitive is effective, by semi-supervised learning and cost-sensitive, can cause the sample after feature extraction With more distinctive, so as to improve classifying quality.

Described above is only some embodiments of the present invention, and the present invention is not only applied to field of software engineering, should referred to Go out, for those skilled in the art, under the premise without departing from the principles of the invention, can also make some Improvements and modifications, in addition to software, for the higher sample of other dimensions, such as recognition of face, palmprint image etc., this method It is equally applicable, it also should be regarded as protection scope of the present invention.

Claims

1. a kind of Software Defects Predict Methods, it is characterised in that comprise the following steps：

Step 1: training sample set is carried out into dimension-reduction treatment, the training sample data collection for projecting to lower dimensional space, specific bag are obtained Include：

(1) sample in sample set is divided into marked sample and unmarked sample, wherein further to marked sample divide For defective sample and zero defect sample, three class adjacent maps are then built respectively, are specifically：

For first kind adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to Connection side is then set up in similar sample and neighbour；

For Equations of The Second Kind adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to Connection side is then set up in foreign peoples's sample and neighbour；

For the 3rd class adjacent map, using all samples in sample set as the adjacent map node, if two nodes belong to Connection side is then set up in unmarked sample and neighbour；

(3) using the principle of laplacian eigenmaps algorithm, the distance between sample point weight determined according to step (2) with And the distance between the sample point after mapping sets up object function, and the object function is converted into generalized eigenvalue equation, solve The equation obtains eigenvectors matrix, further obtains the sample set for projecting to lower dimensional space；

Step 2: treat test sample collection carries out dimension-reduction treatment according to the flow of step one, the test sample number after dimensionality reduction is obtained According to collection；

Step 3: by Naive Bayes Classifier, being obtained according to the training sample data collection and step 2 of step one acquisition Test sample data set, trains forecast model and predicts the classification situation of test sample data set, draws software defect prediction knot Really.

2. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one：

Training sample set X={ x_i, l }, wherein x_iRepresent training sample, x_i∈R^d, d is the dimension of training sample, i=1,2, ... n, n are the quantity of sample, and l is the class label of sample, l ∈ { 0, -1,1 }, wherein -1 represents unmarked sample, 0 indicate without Defect sample, 1 represents defective sample；Similar sample represents that two nodes are defective sample or are zero defect samples, Foreign peoples's sample represents that two nodes one are that defective one, sample is zero defect sample.

3. Software Defects Predict Methods according to claim 1, it is characterised in that in (2) step by step of step one really Determining the distance between sample point weight is specially：

For first kind adjacent map, if node i, j have side connection, then weightOtherwise W_ij=0；T is that thermonuclear is wide Degree；

For Equations of The Second Kind adjacent map, if node i, j have side connection, then weightOtherwise B_ij=0；Wherein C_a,b For cost-sensitive parameter；

4. Software Defects Predict Methods according to claim 3, it is characterised in that (3) step by step of step one are specific such as Under：

A, set up object function：Assuming that y=[y₁,y₂,...,y_n] be the sample for projecting to lower dimensional space, then need solution following Maximization problems：

m a x \frac{\underset{i j}{Σ} | | y_{i} - y_{j} | |^{2} B_{i j}}{\underset{i j}{Σ} | | y_{i} - y_{j} | |^{2} W_{i j} + α \underset{i j}{Σ} | | y_{i} - y_{j} | |^{2} S_{i j}};

B, the Solve problems that the object function in step A is converted into generalized eigenvalue：L^BA=λ L^Ta；By solving the formula, ask Go out matrix A={ a₁,a₂,...,a_r, wherein, λ represents characteristic value；L^B=D^B-B、L^T=D^T-T；I.e. Wherein,That is D^BWith D^TIt is diagonal matrix and D^BWith D^TEach diagonal element point It is not every a line or each row sum in B and T；B is the weight matrix that Equations of The Second Kind adjacent map is built, and Τ=W+ α S, W are the The weight matrix that one class adjacent map is built, S is the weight matrix that the 3rd class adjacent map is built；

C, obtain the sample y after projection according to matrix A, y is the matrix of each row vector composition in matrix A, i.e. y_iIt is the of A I row vectors,Wherein, r represents the quantity of characteristic vector in matrix A,Represent vector f_jI-th point Amount, j=1,2 ... r.

5. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one, root It is to use ε fields method to set up connection side according to neighbour's situation between node, i.e.,：If node i, j satisfaction | | x_i-x_j||²＜ ε, then A side is connected between node i, j, ε is setting value.

6. Software Defects Predict Methods according to claim 1, it is characterised in that in (1) step by step of step one, root It is to use n nearest neighbor methods to set up connection side according to neighbour's situation between node, i.e.,：When node i be node j n neighbours node or Person's node j is node i n neighbour's nodes, then node i, and j is connected using a line.

7. Software Defects Predict Methods according to claim 1, it is characterised in that in (2) step by step of step one really Determining the distance between sample point weight is specially：

To the adjacent map of the first kind, the 3rd class, if node i, j have side connection, its weight is disposed as 1, is otherwise all provided with It is set to 0；

To the adjacent map of Equations of The Second Kind, if node i, j have side connection, then weight is set to C_a,b, 0 is otherwise set to, wherein：C_a,b It is cost-sensitive parameter.

8. the Software Defects Predict Methods according to claim 3 or 7, it is characterised in that cost-sensitive parameter C_a,bRepresenting will A class sample mistakes are categorized as the cost of b class samples, C_a,bTo test setting value, wherein, when a class samples refer to defective sample, b Class sample refers to zero defect sample；Or when a class samples refer to zero defect sample, b class samples refer to defective sample.

9. a kind of software defect forecasting system, it is characterised in that including：Data preprocessing module, dimension-reduction treatment module, training are pre- Survey module,

Data preprocessing module, for obtaining training sample data collection and test sample data set, by the sample in sample set point For marked sample and unmarked sample, wherein being further divided into defective sample and zero defect sample to marked sample；

Prediction module is trained, for by Naive Bayes Classifier, by the training sample data collection after dimension-reduction treatment and test Sample data set, trains forecast model and predicts the classification situation of test sample data set, show that software defect predicts the outcome；

First construction unit, for building first kind adjacent map, be specially：It regard all samples in sample set as the adjacent map Node, if two nodes belong to similar sample and neighbour then set up connection side；

Second construction unit, for building Equations of The Second Kind adjacent map, be specially：It regard all samples in sample set as the adjacent map Node, if two nodes belong to foreign peoples's sample and neighbour then set up connection side；

3rd construction unit, for building the 3rd class adjacent map, be specially：It regard all samples in sample set as the adjacent map Node, if two nodes belong to unmarked sample and neighbour then set up connection side；

Distance weighting computing unit, for for every kind of adjacent map, is determined between sample point according to the connection between node Distance weighting, wherein for the second adjacent map, cost-sensitive information is introduced when calculating sample point distance weighting；

Laplacian eigenmaps unit, for the principle using laplacian eigenmaps algorithm, is calculated according to distance weighting The distance between sample point after the distance between the sample point that unit is determined weight and mapping sets up object function, by the mesh Scalar functions are converted into generalized eigenvalue equation, solve the equation and obtain eigenvectors matrix, further obtain and project to low-dimensional sky Between sample set.

10. software defect forecasting system according to claim 9, it is characterised in that distance weighting computing unit includes：

First computing unit, for calculating the distance between sample point weight in first kind adjacent map, be specially：If node i, j There is side connection, then weightOtherwise W_ij=0；T is thermonuclear width；

Second computing unit, for calculating the distance between sample point weight in Equations of The Second Kind adjacent map, be specially：If node i, j There is side connection, then weightOtherwise B_ij=0；Wherein C_a,bFor cost-sensitive parameter；

3rd computing unit, for calculating the distance between sample point weight in the 3rd class adjacent map, be specially：If node i, j There is side connection, then weightOtherwise S_ij=0.