CN103984756B

CN103984756B - Semi-supervised probabilistic latent semantic analysis based software change log classification method

Info

Publication number: CN103984756B
Application number: CN201410234156.6A
Authority: CN
Inventors: 张小洪; 鄢萌; 傅颖; 徐玲; 杨梦宁; 洪明坚; 葛永新; 杨丹
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2014-05-29
Filing date: 2014-05-29
Publication date: 2017-04-12
Anticipated expiration: 2034-05-29
Also published as: CN103984756A

Abstract

The invention provides a semi-supervised probabilistic latent semantic analysis based software change log classification method. A word dictionary determined through prior knowledge is combined, classification is performed on software change logs objectively according to probabilistic dependencies between words, probabilistic dependencies between the words and change log categories and probabilistic dependencies between the software change logs and the change log categories, and accordingly the classification on the software change logs according to weight values of the word frequency characteristics is avoided, the accuracy of the classification can be improved, and the problems that errors are produced and the accuracy is low in the process of the classification on the software change logs due to the fact that the weight values are set artificially in the prior art are effectively solved.

Description

Software change log classification method based on semi-supervised probability latent semantic analysis

Technical Field

The invention belongs to the technical field of computer information technology and software engineering, and particularly relates to a software change log classification method based on semi-supervised probability latent semantic analysis.

Background

Currently, in the field of computers, it is common to record an operation that has been processed, generate a processing log for knowing the operation condition that has been performed subsequently from the recorded processing log, and determine a corresponding subsequent operation policy according to the recorded processing log.

In the process of running, managing and maintaining computer software, software needs to be repaired due to BUGs, errors or defects, or software functions or software features are added to the software to adapt to new environments or new requirements, or the software needs to be re-edited or re-constructed (also called as software reconfiguration) to improve the readability, reusability, maintainability and the like of the software. These operations will change the software code program, and correspondingly, will also generate the software change log, so that in the later management and maintenance process of the computer software, the change history of the software can be known according to the software change log, thereby being capable of performing statistics and positioning processing on the problems occurring in the software, and further analyzing the quality index, life cycle, operation risk, etc. of the software product. In the log database of software, there may be a lot of software change logs, and to perform software-related analysis according to the software change logs, the software change logs must be classified to know the change operation types recorded in the software change logs.

In the prior art, a software change log is classified by a computer through a software change log classification method by a computer information processing technology, so that the problems of large workload, long time consumption and low efficiency of manual classification are solved. Currently, some related researches are also carried out on software change log classification technology in the field. The common classification method for the software change log is as follows:

the method comprises the following steps: software change logs are extracted from a log database of software.

Step two: and performing stem extraction processing on the software change log by using the conventional stem extraction algorithm to obtain each word contained in the stem of the software change log. The stem extraction processing is to obtain a characteristic word capable of representing the main content described by the software change log, and generally, words without actual content representation meanings, such as "the", "on", "a", "which", and the like, need to be removed in the stem extraction process.

Step three: based on the word frequency characteristics, according to the frequency of the words appearing in a certain category of software change logs, giving the weight value of the category to the words; the higher the frequency of occurrence of a word, the correspondingly greater the weight value it is assigned to in that category. Then, comparing each word in the stem of the extracted software change log, and if the same word also appears in the stem, judging the category of the software change log according to the category weight value of the corresponding word.

However, in such a software change log classification method, the class weight value of a word is often set manually and empirically, and for some synonyms and polysemons, if the set class weight value is not appropriate, problems such as erroneous classification are likely to occur. For example, because two words that are synonyms of each other occur in two different categories with a higher frequency, and the synonym occurs in the software change log originally belonging to category B, the weight value of the synonym in category a is greater, which results in the software change log being misclassified as category a; for another example, a word is an ambiguous word and has a very high probability of appearing in the category a, and has a large weight value, but the ambiguous word appearing in a software change log originally belonging to the category B does not indicate its normal meaning, but the ambiguous word is assigned a too large weight value in the category a, so that the software change log is misclassified as the category a. These misclassifications can all easily lead log managers to obtain erroneous software change log analysis results. Therefore, how to overcome the error caused by artificially setting the weight value and further improve the accuracy of software change log classification becomes the primary problem to be solved in the software change log classification technology.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a software change log classification method based on semi-supervised probability latent semantic analysis, which combines a word dictionary determined by prior knowledge, objectively classifies software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to the weighted value of the word frequency characteristic, overcomes the error caused by artificially setting the weighted value, improves the accuracy of software change log classification, and solves the problems of low accuracy and error in software change log classification caused by artificially setting the weighted value in the prior art.

In order to achieve the purpose, the invention adopts the following technical means:

the software change log classification method based on the semi-supervised probability latent semantic analysis comprises the following steps:

A) according to the priori knowledge, the change log categories are divided, the key words corresponding to each change log category are determined, and the set of all the key words corresponding to each change log category is used as a word dictionary; a key word corresponding to each change log category in the word dictionary is one word in a word stem obtained by performing word stem extraction on the software change logs belonging to the corresponding change log categories according to prior knowledge; the change log categories are specifically divided into three categories, namely:

1 st Change Log Category z₁: repairing software change logs generated by software corruption, errors, or defects;

change Log class 2 z₂: adding a software change log generated by a software function or a software characteristic;

change log category 3 z₃: re-editing the software or reconstructing the generated software change log;

B) acquiring a plurality of software change logs which belong to the three change log categories and have known change log categories as training samples, and taking a set of all the training samples as a training database; respectively counting the class z of the k-th change log in the training database_kTraining ofNumber of samples n_kK ∈ {1, 2., K }, where K is the number of change log categories, that is, K is 3, and stem extraction processing is performed on each training sample in the training database, so as to obtain each word contained in the stem of each training sample;

C) establishing a probabilistic latent semantic analysis model among key words, software change logs and change log categories in a word dictionary:

wherein, P (w)_j|z_k) Representing the jth key word w in the word dictionary_jAnd the kth change log category z_kK ∈ {1,2,3 }; P (z)_k|d_i) Indicates the kth Change Log class z_kAnd ith software change log d_iThe probability relationship of (a); p (d)_i) Indicating the ith software Change Log d_iProbability of number of words with respect to training database, i.e.n_iIndicating the ith software Change Log d_iNumber of words contained in the stem of words, N_baceRepresenting the sum of the number of words contained in the word stems of all training samples in the training database;

D) constructing a likelihood function L of the probability latent semantic analysis model:

where i ∈ {1, 2.. multidata, M }, where M represents the total number of software change logs, j ∈ {1, 2.. multidata, N }, where N represents the total number of key words in the word dictionary, and N (w) (i.e., N represents the total number of key words in the word dictionary)_j,d_i) Representing the jth key word w in the word dictionary_jIn the software change log d_iThe number of occurrences in (a);

E) respectively using each training sample in the training database as a software change log d_iSubstituting the likelihood function L constructed in the step D into each key word w in the word dictionary by adopting an expectation maximization algorithm_jWith each change log category z_kAnd each change log category z_kAnd as a software change log d_iSolving the probability relation of each training sample; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionary_jWith each change log category z_kIs marked as P_c(w_j|z_k) Each change log category z obtained by convergence solving the expectation-maximization algorithm_kAnd as a software change log d_iIs marked as P_c(z_k|d_i) J ∈ {1, 2.. multidata, N }, i ∈ {1, 2.. multidata, M }, K ∈ {1, 2.. multidata, K }, and calculating each change log category z separately_kWherein the kth change log category z_kSample center probability relationship ofComprises the following steps:

at the moment, the total number M of the software change logs is taken as the total number of training samples in the training database;

F) acquiring a software change log of a change log category to be determined as a sample to be tested, and taking a set of all samples to be tested as a test database; respectively carrying out stem extraction processing on each sample to be tested in the test database to obtain each word contained in the stem of each sample to be tested;

G) respectively taking each sample to be tested in the test database as a software change log d_iSubstitution into the likelihood function L constructed in step DUsing expectation maximization algorithm, for each change log category z_kAnd as a software change log d_iSolving the probability relation of each sample to be tested;

H) according to each change log type z obtained in step G_kProbability relation with each sample to be tested, and each change log category z_kSample center probability relationship ofRespectively calculating each sample to be tested and each change log category z_kSimilarity of sample center probability Sim (d)_x,m,z_k) Thus, determining the change log category to which each sample to be tested belongs:

wherein, X_mRepresents an arbitrary m-th sample d to be measured_x,mThe change log category to which it belongs; similarity Sim (d)_x,m,z_k) Comprises the following steps:

wherein, P_x(z_k|d_x,m) Indicates the k-th change log type z obtained in step G_kAnd the m-th sample d to be measured_x,mThe probability relationship of (a);

thus, a category label is added to each software change log as a sample to be tested according to the determined change log category to which each sample to be tested belongs.

In the software change log classification method based on semi-supervised probability latent semantic analysis, specifically, the step E specifically includes:

e1) respectively using each training sample in the training database as softwareChange Log d_iSubstituting the likelihood function L constructed in the step D with i ∈ {1, 2.., M }, wherein the total number M of the software change logs is the total number of training samples in the training database, and classifying the k change log into a class z_kAnd as a software change log d_iOf the training samples P (z)_k|d_i) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionary_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) The initial values of (a) are:

wherein n is_kIndicating the class z of the k-th change log in the training database_kNumber of training samples of (c), k ∈ {1,2,3 }; n_j,kRepresenting the jth key word w in the word dictionary_jBelongs to the kth Change Log class z in the training database_kThe number of occurrences in the training sample;

e2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationship_j|z_k) And a probability relation P (z)_k|d_i) Respectively calculating each change log category z_kConditional distribution probability P (z)_k|d_i,w_j)，k∈{1,2,...,K}：

e3) In M-step of the expectation-maximization algorithm, the conditional distribution probability P (z) obtained in step e2 is used_k|d_i,w_j) Respectively for each key word w in the word dictionary_jJ ∈ {1, 2., N }, as a software change log d in the training database_iI ∈ {1, 2.., M }, and a respective change log class z_kK ∈ {1, 2.. K }, is related to probabilityIs P (w)_j|z_k) And a probability relation P (z)_k|d_i) Updating the value of (a):

wherein n (w)_j,d_i) Representing the jth key word w in the word dictionary_jIn the software change log d_iThe number of occurrences in (a);representing key words in a word dictionary in a software change log d_iThe total number of occurrences in (a);

e4) repeating steps e 2-e 3 until the expectation maximization algorithm converges; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionary_jWith each change log category z_kIs marked as P_c(w_j|z_k) Each change log category z obtained by convergence solving the expectation-maximization algorithm_kAnd as a software change log d_iIs marked as P_c(z_k|d_i) J ∈ {1, 2.. multidata, N }, i ∈ {1, 2.. multidata, M }, K ∈ {1, 2.. multidata, K }, and calculating each change log category z separately_kWherein the kth change log category z_kSample center probability relationship ofComprises the following steps:

in the software change log classification method based on semi-supervised probability latent semantic analysis, specifically, the step G specifically includes:

g1) will measureEach sample to be tested in the test database is respectively used as a software change log d_iSubstituting the obtained data into a likelihood function L constructed in the step D, i ∈ {1, 2.., M }, taking the total number M of the software change logs as the total number of samples to be tested in the test database, and classifying the kth change log into a class z_kAnd as a software change log d_iIs measured on the probability relation P (z) of the sample to be measured_k|d_i) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionary_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) The initial values of (a) are:

P(w_j|z_k)＝P_c(w_j|z_k)；

g2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationship_j|z_k) And a probability relation P (z)_k|d_i) Respectively calculating each change log category z_kConditional distribution probability P (z)_k|d_i,w_j)，k∈{1,2,...,K}：

g3) In M-step of the expectation maximization algorithm, the conditional distribution probability P (z) obtained in step g2 is utilized_k|d_i,w_j) Respectively for each key word w in the word dictionary_jJ ∈ {1, 2., N }, in the test database as a software change log d_iI ∈ {1, 2.., M }, and each change log category z_kK ∈ {1, 2.., K }, for the probability relationship P (w)_j|z_k) And a probability relation P (z)_k|d_i) Updating the value of (a):

wherein,n(w_j,d_i) Representing the jth key word w in the word dictionary_jIn the software change log d_iThe number of occurrences in (a);representing key words in a word dictionary in a software change log d_iThe total number of occurrences in (a);

g4) repeating the steps g 2-g 3 until the expectation maximization algorithm is converged, thereby obtaining each change log category z obtained by convergence solution of the expectation maximization algorithm_kAnd as a software change log d_iThe probability relationship of each sample to be measured.

Compared with the prior art, the invention has the following beneficial effects:

1. the software change log classification method based on semi-supervised probability latent semantic analysis combines the word dictionary determined by the prior knowledge, objectively classifies the software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to the weight value of the word frequency characteristic, improves the classification accuracy, and effectively solves the problems of error and lower accuracy of software change log classification caused by artificially setting the weight value in the prior art.

2. In the software change log classification method based on semi-supervised probability latent semantic analysis, in the process of determining probability correlation characteristics of different key words based on a training database obtained by prior knowledge, an expectation maximization algorithm is utilized for solving, and a probability relation P (w) is set_j|z_k) Is initially taken asCompared with randomly setting probability relation P (w)_j|z_k) The initial value of (2) can reflect the key sheetWord w_jBelongs to the kth Change Log class z in the training database_kThe objective distribution condition in the training samples is beneficial to improving the convergence speed of the expectation-maximization algorithm, and objectively embodies the probability correlation characteristics of different key words in the word dictionary based on the training database and the probability correlation characteristics of the training samples in the training database and the change log categories.

3. In the software change log classification method based on semi-supervised probability latent semantic analysis, in the process of determining probability correlation characteristics of each sample to be tested and each change log category in a test database, an expectation maximization algorithm is utilized to solve, and the jth key word w in a word dictionary is_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) Is set as P (w)_j|z_k)＝P_c(w_j|z_k) The probability correlation characteristics of different key words in the word dictionary based on the training database are utilized, the convergence speed of the expectation maximization algorithm is improved, and the determination of the direct probability correlation characteristics of each sample to be tested and the change log category can be based on the actual situation of the training database.

4. In the software change log classification method based on semi-supervised probability latent semantic analysis, the similarity of the sample to be detected on each change log category is comprehensively considered in the process of confirming the change log category to which the sample to be detected belongs, and the change log category to which the sample to be detected belongs is determined according to the maximum similarity, so that the software change logs of the category to be determined are objectively classified, the classification of the software change logs is avoided only according to the weight value of a word given to a certain category, and the classification judgment of the software change logs is more comprehensive and accurate.

5. The software change log classification method based on semi-supervised probability latent semantic analysis has good feasibility and effectiveness in practical application.

Drawings

FIG. 1 is a flow chart of a software change log classification method based on semi-supervised probability latent semantic analysis according to the present invention.

FIG. 2 is a statistical result chart of classification accuracy in a validation experiment according to the present invention.

Detailed Description

In the existing software change log classification method, a weight value given by a word in a certain change log category is set artificially according to word frequency characteristics, and software change logs are classified according to the weight value, so that under the condition that synonyms and polysemons occur, the phenomenon of misclassification is easy to occur, the classification accuracy of the software change logs is reduced, and the analysis of the software change logs by log management personnel is influenced. Aiming at the problem, the invention provides a software change log classification method based on semi-supervised probability latent semantic analysis, which combines a word dictionary determined by prior knowledge, objectively classifies software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to a weighted value of word frequency characteristics, overcomes errors caused by artificially setting the weighted value and achieves the aim of improving the classification accuracy of the software change logs.

The invention relates to a software change log classification method based on semi-supervised probability latent semantic analysis, which has a specific flow as shown in figure 1 and comprises the following steps:

change log category 3 z₃: re-editing the software or reconstructing the resulting software change log.

In this step, the confirmation of the keyword corresponding to each change log category needs to be obtained through prior knowledge. Such as by human recognition or some key word that has been known to be confirmed in the three change log categories described above. For example, based on the Swanson's modified classification system, a word dictionary as shown in Table 1 can be constructed based on the work of Mauczka et al.

TABLE 1

Of course, in different software change log description environments, or different software types for specific applications, the specific resulting word dictionary may also be different.

B) Acquiring a plurality of software change logs which belong to the three change log categories and have known change log categories as training samples, and taking a set of all the training samples as a training database; respectively counting the class z of the k-th change log in the training database_kNumber n of training samples_kK ∈ {1, 2., K }, where K is the number of change log categories, i.e., K3, and for the training dataAnd (4) respectively carrying out stem extraction processing on each training sample in the library to obtain each word contained in the stem of each training sample.

In this step, the software change log as the training sample needs to belong to one of the three change log categories, and the category of the change log to which the software change log belongs is known. These software change logs as training samples also need to be acquired and judged by prior knowledge, for example, by manual identification and judgment, or by prior identification means, so as to obtain the change log category to which the software change logs belong. Meanwhile, in the three change log categories, a plurality of training samples belonging to each change log category should be provided; of course, the more training samples are obtained for each change log category, the more accurate the classification effect of the method of the present invention is. The stem extraction processing for each training sample can be realized by using a stem extraction algorithm in the prior art, and words without actual content representation meanings, such as "the", "on", "a", "which", and the like, need to be removed. The stem extraction algorithm itself is not a technical contribution of the present invention, and therefore, redundant description is not provided.

wherein, P (w)_j|z_k) Representing the jth key word w in the word dictionary_jAnd the kth change log category z_kK ∈ {1,2,3 }; P (z)_k|d_i) Indicates the kth Change Log class z_kAnd ith software change log d_iThe probability relationship of (a); p (d)_i) Indicating the ith software Change Log d_iProbability of number of words with respect to training database, i.e.n_iIndicating the ith software Change Log d_iNumber of words contained in the stem of words, N_baceRepresenting the sum of the number of words contained in the stems of all training samples in the training database.

The method of the invention utilizes a probabilistic latent semantic analysis model to establish the probability correlation between words, between words and change log categories and between software change logs and change log categories.

where i ∈ {1, 2.. multidata, M }, where M represents the total number of software change logs, j ∈ {1, 2.. multidata, N }, where N represents the total number of key words in the word dictionary, and N (w) (i.e., N represents the total number of key words in the word dictionary)_j,d_i) Representing the jth key word w in the word dictionary_jIn the software change log d_iThe number of occurrences in (c).

The method solves the probability latent semantic analysis model by using the likelihood function L.

E) Respectively using each training sample in the training database as a software change log d_iSubstituting the likelihood function L constructed in the step D into each key word w in the word dictionary by adopting an expectation maximization algorithm_jWith each change log category z_kAnd each change log category z_kAnd as a software change log d_iThe probability relationship of each training sample is solved. The method comprises the steps of solving probability correlation relations between key words and change log categories and between training samples and change log categories aiming at a training database, and accordingly determining probability correlation characteristics of different key words based on the training database obtained by priori knowledge. The specific process of the step is as follows:

e1) respectively using each training sample in the training database as a software change log d_iSubstituting the likelihood function L constructed in the step D with i ∈ {1, 2.., M }, wherein the total number M of the software change logs is the total number of training samples in the training database, and classifying the k change log into a class z_kAnd as a software change log d_iOf the training samples P (z)_k|d_i) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionary_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) The initial values of (a) are:

e3) In M-step of the expectation-maximization algorithm, the conditional distribution probability P (z) obtained in step e2 is used_k|d_i,w_j) Respectively for each key word w in the word dictionary_jJ ∈ {1, 2., N }, as a software change log d in the training database_iEach training sample of i ∈ {1,2Change Log Category z_kK ∈ {1, 2.., K }, for the probability relationship P (w)_j|z_k) And a probability relation P (z)_k|d_i) Updating the value of (a):

in the step, a mature expectation maximization algorithm is utilized to solve, and a probability relation P (w) is set_j|z_k) Is initially taken asCompared with randomly setting probability relation P (w)_j|z_k) The initial value of (2) can reflect the key word w_jBelongs to the kth Change Log class z in the training database_kThe objective distribution condition in the training sample is beneficial to improving the convergence speed of the expectation maximization algorithm. And the expectation maximization algorithm converges and solves each key word w in the obtained word dictionary_jWith each change log category z_kAnd each change log category z obtained by convergence solution of expectation-maximization algorithm_kAnd as a software change log d_iThe probability relation of each training sample objectively embodies the probability correlation characteristics of each different key word in the word dictionary based on the training database and the probability correlation characteristics of each training sample in the training database and each change log category, and the characteristics can be used as the basis for classifying the software change log to be tested subsequently.

F) Acquiring a software change log of a change log category to be determined as a sample to be tested, and taking a set of all samples to be tested as a test database; and respectively carrying out stem extraction processing on each sample to be tested in the test database to obtain each word contained in the stem of each sample to be tested.

And in the test database, the number of the software change logs serving as samples to be tested is determined according to the actual need of the software change log condition of the change log category to be determined. The software change log classification and identification method can be applied to the application condition that the test database contains any plurality of software change logs of which the change log types are to be determined. The stem extraction processing for each sample to be detected can be realized by using a stem extraction algorithm in the prior art, and words without actual content representation meanings such as the, on, a, while which and the like also need to be removed.

G) Will testEach sample to be tested in the database is respectively used as a software change log d_iSubstituting into the likelihood function L constructed in step D, adopting expectation maximization algorithm, and classifying each change log z_kAnd as a software change log d_iSolving the probability relation of each sample to be tested. In the step, the probability correlation characteristics of each sample to be tested and each change log category in the test database are determined according to the probability correlation characteristics of different key words in the word dictionary based on the training database. The specific process of the step is as follows:

g1) respectively taking each sample to be tested in the test database as a software change log d_iSubstituting the obtained data into a likelihood function L constructed in the step D, i ∈ {1, 2.., M }, taking the total number M of the software change logs as the total number of samples to be tested in the test database, and classifying the kth change log into a class z_kAnd as a software change log d_iIs measured on the probability relation P (z) of the sample to be measured_k|d_i) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionary_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) The initial values of (a) are:

P(w_j|z_k)＝P_c(w_j|z_k)；

g3) In M-step of the expectation maximization algorithm, the conditional distribution probability P (z) obtained in step g2 is utilized_k|d_i,w_j) For each of the word dictionariesKey word w_jJ ∈ {1, 2., N }, in the test database as a software change log d_iI ∈ {1, 2.., M }, and each change log category z_kK ∈ {1, 2.., K }, for the probability relationship P (w)_j|z_k) And a probability relation P (z)_k|d_i) Updating the value of (a):

In the step, the expectation maximization algorithm is also utilized to solve, and in the solving process, the jth key word w in the word dictionary is solved_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) Is set as P (w)_j|z_k)＝P_c(w_j|z_k) That is, the probability correlation characteristic of different key words in the word dictionary based on the training database is utilized to solve and obtain various change log categories z_kThe probability relation with each sample to be tested is beneficial to improving the convergence rate of the expectation maximization algorithm on one hand, and on the other hand, the determination of the probability correlation characteristics of each sample to be tested and the change log category can be realized, and the actual condition of the training database can be used as an objective basis.

It can be seen that, in the process of confirming the change log category to which the sample to be tested belongs, the method of the present invention gives each change log category z obtained in step G_kProbability relation with each sample to be tested, and each change log category z_kSample center probability relationship ofComprehensively considerThe similarity of the sample on each change log category is determined, and the change log category to which the sample to be detected belongs is determined according to the maximum similarity, so that the software change logs of the category to be determined are objectively classified, the classification of the software change logs is avoided only according to the weight value given to the word in a certain category, and the classification judgment of the software change logs is more comprehensive and accurate.

Compared with the existing software change log classification method, the software change log classification method based on semi-supervised probability latent semantic analysis combines the word dictionary determined by the priori knowledge, objectively classifies the software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to the weighted value of the word frequency characteristic, improves the classification accuracy, and effectively solves the problems of error and lower accuracy of software change log classification caused by artificially setting the weighted value in the prior art.

And (3) verification experiment:

the invention also verifies the effectiveness and the accuracy of the classification method by adopting Cohen's Kappa value as the verification standard on the verification of the classification accuracy through experiments and comparing the software change log classification method based on the semi-supervised probability latent semantic analysis with the ' firstkey ' classification method proposed by Hattori et al. The "firstkey" classification method proposed by Hattori et al is a method for classifying log data based on weight assignment of word frequency characteristics. In this experiment, the two classification methods are applied to five existing large open source projects, namely, Bugzilla, wirehardk, Boost, Firebird and Python, and the software change log data sets of the five open source projects are shown in table 2.

TABLE 2

The software change log data of the five open source projects are classified into the following three categories by respectively adopting a 'firstkey' classification method and the classification method of the invention:

The Cohen's Kappa value is used as the verification standard of the classification accuracy, and the obtained classification accuracy statistical result is shown in FIG. 2. As can be seen from FIG. 2, the classification method of the invention has examined the "firstkey" classification method proposed by Hattori et al on the classification accuracy of the software change log of each open source item, which shows that the classification accuracy of the software change log classification method based on semi-supervised probability latent semantic analysis of the invention is obviously superior to that of the existing software change log classification method; meanwhile, the classification result of the software change log of the five open source items reaches the average Cohen's Kappa value of 0.53, and according to the measurement standard provided by El Emam, the average Cohen's Kappa value higher than 0.50 represents that the classification result has very high coincidence degree with the real result, so that the software change log classification method based on the semi-supervised probability latent semantic analysis has very good feasibility and effectiveness in practical application.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. The software change log classification method based on semi-supervised probability latent semantic analysis is characterized by comprising the following steps of:

B) acquiring a plurality of software change logs which belong to the three change log categories and have known change log categories as training samples, and taking a set of all the training samples as a training database; respectively counting the class z of the k-th change log in the training database_kNumber n of training samples_kK ∈ {1,2, …, K }, where K is the number of change log categories, that is, K is 3, and stem extraction processing is performed on each training sample in the training database to obtain each word contained in the stem of each training sample;

P (w_{j}, d_{i}) = P (d_{i}) Σ_{k = 1}^{K} [P (z_{k} | d_{i}) P (w_{j} | z_{k})];

L = Π_{i = 1}^{M} Π_{j = 1}^{N} P (d_{i}) Σ_{k = 1}^{K} [P (z_{k} | d_{i}) P {(w_{j} | z_{k})}^{n (w_{j}, d_{i})}];

where i ∈ {1,2, …, M }, where M represents the total number of software change logs, j ∈ {1,2, …, N }, where N represents the total number of key words in the word dictionary, and N (w)_j,d_i) Representing the jth key word w in the word dictionary_jIn the software change log d_iThe number of occurrences in (a);

E) respectively using each training sample in the training database as a software change log d_iSubstituting the likelihood function L constructed in the step D into each key word w in the word dictionary by adopting an expectation maximization algorithm_jWith each change log category z_kAnd each change log category z_kAnd as a software change log d_iSolving the probability relation of each training sample; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionary_jWith each change log category z_kIs marked as P_c(w_j|z_k) Each change log category z obtained by convergence solving the expectation-maximization algorithm_kAnd as a software change log d_iIs marked as P_c(z_k|d_i) J ∈ {1,2, …, N }, i ∈ {1,2, …, M }, K ∈ {1,2, …, K }, and calculating each change log category z, respectively_kWherein the kth change log category z_kSample center probability ofIs a systemComprises the following steps:

{\overset{&OverBar;}{P}}_{c} (z_{k}) = \frac{Σ_{i = 1}^{M} P_{c} (z_{k} | d_{i})}{M};

the method comprises the following steps:

e1) respectively using each training sample in the training database as a software change log d_iSubstituting the likelihood function L constructed in the step D with i ∈ {1,2, …, M }, taking the total number M of the software change logs as the total number of training samples in the training database, and classifying the k-th change log into a class z_kAnd as a software change log d_iOf the training samples P (z)_k|d_i) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionary_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) The initial values of (a) are:

P (w_{j} | z_{k}) = \frac{n_{j, k}}{n_{k}};

wherein n is_kIndicating the class z of the k-th change log in the training database_kNumber of training samples of (c), k ∈ {1,2,3 }; n_j,_kRepresenting the jth key word w in the word dictionary_jBelongs to the kth Change Log class z in the training database_kThe number of occurrences in the training sample;

e2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationship_j|z_k) And a probability relation P (z)_k|d_i) Respectively calculating each change log category z_kConditional distribution probability P (z)_k|d_i,w_j)，k∈{1,2,…,K}：

P (z_{k} | d_{i}, w_{j}) = \frac{P (w_{j} | z_{k}) P (z_{k} | d_{i})}{Σ_{k = 1}^{K} [P (w_{j} | z_{k}) P (z_{k} | d_{i})]};

e3) In M-step of the expectation-maximization algorithm, the conditional distribution probability P (z) obtained in step e2 is used_k|d_i,w_j) Respectively for each key word w in the word dictionary_jJ ∈ {1,2, …, N }, as a software change log d in the training database_iI ∈ {1,2, …, M }, and a respective change log category z_kK ∈ {1,2, …, K }, versus probability relationship P (w)_j|z_k) And a probability relation P (z)_k|d_i) Updating the value of (a):

P (w_{j} | z_{k}) = \frac{Σ_{i = 1}^{M} [n (w_{j}, d_{i}) P (z_{k} | d_{i}, w_{j})]}{Σ_{j = 1}^{N} Σ_{i = 1}^{M} [n (w_{j}, d_{i}) P (z_{k} | d_{i}, w_{j})]}; P (z_{k} | d_{i}) = \frac{Σ_{j = 1}^{N} [n (w_{j}, d_{i}) P (z_{k} | d_{i}, w_{j})]}{n (d_{i})};

e4) repeating steps e 2-e 3 until the expectation maximization algorithm converges; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionary_jWith each change log category z_kIs marked as P_c(w_j|z_k) Each change log category z obtained by convergence solving the expectation-maximization algorithm_kAnd as softwareChange Log d_iIs marked as P_c(z_k|d_i) J ∈ {1,2, …, N }, i ∈ {1,2, …, M }, K ∈ {1,2, …, K }, and calculating each change log category z, respectively_kWherein the kth change log category z_kSample center probability relationship ofComprises the following steps:

{\overset{&OverBar;}{P}}_{c} (z_{k}) = \frac{Σ_{i = 1}^{M} P_{c} (z_{k} | d_{i})}{M};

G) respectively taking each sample to be tested in the test database as a software change log d_iSubstituting into the likelihood function L constructed in step D, adopting expectation maximization algorithm, and classifying each change log z_kAnd as a software change log d_iEach sample to be testedSolving the probability relation;

X_{m} = \arg \underset{k}{m a x} [S i m (d_{x, m}, z_{k})], k &Element; {1, 2, 3};

S i m (d_{x, m}, z_{k}) = \frac{Σ_{k = 1}^{K} [P_{x} (z_{k} | d_{x, m}) \cdot {\overset{&OverBar;}{P}}_{c} (z_{k})]}{\sqrt{Σ_{k = 1}^{K} P_{x} {(z_{k} | d_{x, m})}^{2} \cdot Σ_{k = 1}^{K} {\overset{&OverBar;}{P}}_{c} {(z_{k})}^{2}}};

2. The software change log classification method based on semi-supervised probability latent semantic analysis according to claim 1, wherein the step G specifically comprises:

g1) respectively taking each sample to be tested in the test database as a software change log d_iSubstituting the likelihood function L constructed in the step D with i ∈ {1,2, …, M }, taking the total number M of the software change logs as the total number of samples to be tested in the test database, and classifying the k change log into a category z_kAnd as a software change log d_iIs measured on the probability relation P (z) of the sample to be measured_k|d_i) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionary_jAnd the kth change log category z_kProbability relation P (w)_j|z_k) The initial values of (a) are:

P(w_j|z_k)＝P_c(w_j|z_k)；

g2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationship_j|z_k) And a probabilistic relationship P: (z_k|d_i) Respectively calculating each change log category z_kConditional distribution probability P (z)_k|d_i,w_j)，k∈{1,2,…,K}：

P (z_{k} | d_{i}, w_{j}) = \frac{P (w_{j} | z_{k}) P (z_{k} | d_{i})}{Σ_{k = 1}^{K} [P (w_{j} | z_{k}) P (z_{k} | d_{i})]};

g3) In M-step of the expectation maximization algorithm, the conditional distribution probability P (z) obtained in step g2 is utilized_k|d_i,w_j) Respectively for each key word w in the word dictionary_jJ ∈ {1,2, …, N }, as a software change log d in a test database_iI ∈ {1,2, …, M }, and each change log category z_kK ∈ {1,2, …, K }, versus probability relationship P (w)_j|z_k) And a probability relation P (z)_k|d_i) Updating the value of (a):

P (w_{j} | z_{k}) = \frac{Σ_{i = 1}^{M} [n (w_{j}, d_{i}) P (z_{k} | d_{i}, w_{j})]}{Σ_{j = 1}^{N} Σ_{i = 1}^{M} [n (w_{j}, d_{i}) P (z_{k} | d_{i}, w_{j})]}; P (z_{k} | d_{i}) = \frac{Σ_{j = 1}^{N} [n (w_{j}, d_{i}) P (z_{k} | d_{i}, w_{j})]}{n (d_{i})};