CN111782807B - Self-bearing technology debt detection classification method based on multiparty integrated learning - Google Patents

Self-bearing technology debt detection classification method based on multiparty integrated learning Download PDF

Info

Publication number
CN111782807B
CN111782807B CN202010568813.6A CN202010568813A CN111782807B CN 111782807 B CN111782807 B CN 111782807B CN 202010568813 A CN202010568813 A CN 202010568813A CN 111782807 B CN111782807 B CN 111782807B
Authority
CN
China
Prior art keywords
annotation
self
feature
classifier
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010568813.6A
Other languages
Chinese (zh)
Other versions
CN111782807A (en
Inventor
殷茗
徐悦然
田嘉毅
朱奎宇
马怀宇
张小港
薛禹坤
吴瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202010568813.6A priority Critical patent/CN111782807B/en
Publication of CN111782807A publication Critical patent/CN111782807A/en
Application granted granted Critical
Publication of CN111782807B publication Critical patent/CN111782807B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a self-supporting technology detection classification method based on multi-method integrated learning, which comprises the following five steps: preprocessing the feature words; selecting the first k most useful features to train the classifier; training corresponding sub-classifiers by using two methods of naive Bayesian polynomials and linear Logistic regression; the prediction result is integrated and predicted through a sub-classifier voting rule, so that accuracy and recall rate are obtained, and an F1 value is finally calculated as a subsequent evaluation standard by integrating the accuracy and the recall rate; finally, the features which frequently occur in the experimental process and have high information gain values are clustered by a clustering method, so that the detected technical debt is classified.

Description

Self-bearing technology debt detection classification method based on multiparty integrated learning
Technical Field
The invention belongs to the technical field of software development, and particularly relates to a self-bearing liability classification detection method based on multiparty integrated learning.
Background
Document "Huang Q,Shihab E,Xia X,et al.Identifying self-admitted technical debt in open source projects using text mining[J].Empirical Software Engineering,2017." discloses a method for automatically detecting self-bearing technical liabilities using an integrated classifier. The method uses source code annotations from different software items to analyze the annotations to be detected. The method comprises preprocessing source file, selecting and screening features by using features, and using naive Bayes polynomialBayes Multinomial) training each classifier, and finally predicting an integrated classifier consisting of a plurality of classifiers according to voting rules to determine whether self-supporting technical liabilities exist in sentences. The method is verified to have better running performance than the text mode-based method and the NLP classifier method. However, the accuracy of the individual classifier is low due to the single classifier training method, so that the final result is inaccurate.
Self-bearing technical liability (SATD) is a term that has been proposed to express liabilities deliberately introduced in the software development process, typically to enable projects to be developed more rapidly in a short period of time, and at the cost of requiring maintenance in the future, the focus of the study of the present invention being to accurately detect SATD in order to help alleviate the cost of maintaining SATD. However, some methods up to now rely on manual detection, and many advanced methods automatically identify the SATD by adopting a single natural language detection mode, however, the inefficiency of the manual detection method is obvious, the performance of a classifier of the single natural language detection mode is low, the flexibility is poor, and although the previous detection has better results, the diversity of the SATD in projects and the characteristics such as semantic change all bring great challenges for accurately detecting the SATD. Therefore, in order to improve the accuracy and flexibility of SATD detection, the invention provides a SATD detection method for multi-method integrated learning, 8 open-source items are taken as a data set, annotation texts are preprocessed, features are extracted by a feature selection method, and then the feature is usedBayes Multinomial and the Simple Logistic method are integrated to train the classifier, and finally the sub-classifier is integrated to form an integrated classifier, and the annotated classification labels are obtained according to voting rules, so that self-bearing technical liabilities are accurately identified. Finally, the experimental result is compared with three experimental baselines (based on modes, single-method ensemble learning and NLP), and the results show that the SATD detection method of the method provided by the invention has high accuracy, obviously improves recall rate, can achieve better detection effect and is obviously superior to the prior detection method.
Disclosure of Invention
Technical problem to be solved
The key technical difficulty of the self-bearing technology liability detection problem in the process of researching software development is as follows: the self-supporting technology debt nature is source code level, but is generally needed to analyze by detecting source code annotation and decomposing sentences into feature words, and when training a classifier, a training model is single, and the result error is large. The invention focuses on multi-method integrated learning to detect self-bearing technical liabilities, and firstly, feature words are preprocessed in the invention. And removing stop words and punctuation marks, selecting features by only taking effective words, filtering invalid features and reducing noise. Further consider word proximity, for example: the method comprises the steps of unifying words with the same stem such as happy, happiness, happier and the like into the stem by using a porter; the first k most useful features are then selected to train the classifier. The preprocessed features the present invention selects the first k most useful features for training; secondly, using naive Bayes polynomialBayes Multinomial) and linear Logistic regression (Simple Logistic) to train the corresponding sub-classifier, so that the predicted F1 value of each time is best, and the accuracy of prediction is improved as much as possible while predicting as much self-bearing technical liabilities as possible. And finally, carrying out integrated prediction on the prediction result trained by the sub-classifier by using the rule of the sub-classifier voting, and determining the prediction result of the final integrated classifier in each process.
Technical proposal
A self-bearing technology debt detection classification method based on multiparty integrated learning is characterized by comprising the following steps:
step 1: preprocessing the feature words
Processing the raw annotation data using heuristic rules:
(1) Deleting license description class notes with fixed formats automatically generated by a compiler;
(2) Merging the multiple lines of annotations into a sentence;
(3) Deleting the code present in the annotation statement;
(4) Deleting the Javadoc which does not contain the reserved word, and reserving the annotation statement containing the reserved word;
Step 2: selecting the first k most useful features to train the classifier
After text pre-processing the source item annotation, the present invention uses the vector space model VSM to process words that have been partitioned into features; in this model, each sentence annotation is represented by a word vector, and the divided word features can be regarded as dimensions, and each sentence annotation can be regarded as a data point in a high-dimensional space; the invention uses HashMap as the mapping of VSM model, wherein character type mark is divided feature, double-precision type numerical value is word frequency, namely the number of times of feature in current annotation, and the number of times is standardized;
information gain is used as a widely used feature selection method to select useful features: let the annotation dataset be represented as C = { (C 1,L1),(C2,L2),...,(CN,LN)},Ci represents the ith annotation and L i represents the class label of that annotation, i.e. (t) no There are self-bearing technical liabilities; let C i={w1,w2,…,wn }, where n represents the number of features in note C i and w i represents the ith feature in the sentence of notes; for a feature w and a note C i, there are 4 possible relationships between them:
· (w, t): note C i contains feature w and there is a self-bearing technical liability (i.e., t) in the sentence note
·Note C i contains feature w, but no self-bearing technical liability/>, is present in the sentence note
·Note C i does not contain feature w, but there is a self-bearing technical liability in this sentence note (i.e., t)
·Note C i contains no feature w and no self-bearing technical liability/>, is present in the sentence note
Based on the 4 possible relationships described above, the information gain for feature w and tag t is calculated as follows:
Wherein p (w ', t') represents the probability that feature w 'appears in the annotation with tag t', p (w ') represents the probability that feature w' appears in the annotation, and p (t ') represents the probability that the annotation has tag t';
after the information gain value corresponding to each feature is calculated by using the information gain method, the features are ordered from big to small according to the size of the information gain value; the higher the score, the more important the description feature is in predicting classification tags; the invention selects the characteristic of the information gain value at the first k percent and discards other characteristics;
Step 3: training sub-classifiers using naive Bayes polynomials and linear Logistic regression
(1):Native Bayes Multinomial
Six classifiers, namely No. 2, no. 3, no. 4, no. 5, no. 6 and No. 8, are set as polynomial naive Bayes classifier NBM, and training is carried out by using an NBM method; let the annotation set be C i={w1,w2,…,wn }, the classification label be L i, available:
applying the bayesian theorem on equation (3) yields:
identifying the annotated class label by formula (4);
(2):Simple Logistic
In the experiment, two classifiers, namely a No. 1 classifier and a No. 7 classifier, are set as linear Logistic classifiers; let the annotation dataset be denoted C = { (C 1,L1),(C2,L2),...,(CN,LN) }, where C i represents the ith annotation and L i represents the classification label of the annotation, i.e. whether there is a self-bearing technical liability; in addition, let C i be denoted as C i={w1,w2,…,wn, where n represents the number of features in note C i and w i represents the ith feature in the note; according to the linear logistic regression theorem, it is possible to:
z=θ1w12w2+…+θnwn0=θTCi (5)
it is brought into a sigmoid function, which is expressed as follows:
dividing the annotation to be detected into two classes according to the final result of the sigmoid function, wherein the annotation statement with the classification label value of 1 is the annotation statement with the self-bearing technical debt;
Step 4: sub-classifier voting rules
Taking the classification label result predicted by the majority of sub-classifiers as a final prediction result of the integrated classifier by adopting a voting rule;
Step 5: clustering self-bearing technology debt classification
The invention carries out rescreening and deleting on the original data according to the frequency of the occurrence of the characteristics, the position of the occurrence of the characteristics and the characteristics of the developer through the characteristics selected by the information gain values in the steps, and finally classifies the characteristic words by using a clustering method.
The k% described in step 2 is 10%.
Advantageous effects
The self-bearing technology debt detection classification method based on multi-method integrated learning solves the problem of optimization of self-bearing technology debt detection in the software development process. The invention fully considers the detection characteristics of self-bearing technology debt, carries out training by text preprocessing and using corresponding methods by using classifiers of different data sets, creatively proposes a method for carrying out characteristic selection by using evaluation information gain values and then using different classifier training methods for different classifiers to improve the performance of the training classifier, and integrates sub-classifiers to form an integrated classifier, so that the prediction result of final detection is optimized, the detection precision and the detection range are improved, the classification index tends to be balanced, and the detection index is further obviously improved. Finally, the clustering method is used for classifying the features, and the clustering result is used for analyzing the self-bearing technical liability category to which the features belong, so that the detection and classification effects of the self-bearing technical liability are achieved.
The invention is used as a self-bearing technology debt detection classification technology based on multi-method integrated learning, fully considers the attribute characteristics of the self-bearing technology debt in the software development process, carefully analyzes the characteristics of the self-bearing technology debt in the software development process according to the risk criterion of improving the software quality and reducing the hiding in the software development process as much as possible, adopts the information gain to quantify the characteristic influence factors and refines the classification detection process. Meanwhile, in the feature training process, a self-bearing technology debt classification detection technology using naive Bayesian polynomials and linear Logistics regression is creatively provided, and finally, the features are classified by using a clustering method, so that the detection and classification effects of the self-bearing technology debt are obtained. The experimental results are finally compared with the results of four different self-supporting technology debt detection methods, namely Potdar and Shihab mode-based methods, integrated learning evaluation indexes by using a single method, an optimal sub-classifier, a Natural Language (NLP) -based maximum entropy classifier method and the like, and the results obtained by the four different technical methods are improved in precision, recall rate and F1 value to different degrees, and are improved by 51.87%, 16.22%, 28.76% and 32.12% respectively in the more comprehensive F1 value compared with other methods. And finally classifying the detected characteristics with the self-bearing technical debt to obtain a final result.
Drawings
FIG. 1 is a flow chart of the method of the present invention
FIG. 2 graph of entropy versus probability
FIG. 3 sigmoid functional curve
Detailed Description
The invention will now be further described with reference to examples, figures:
The invention relates to a self-supporting technology detection classification method based on multi-method integrated learning. The method mainly comprises five core steps: preprocessing the feature words; selecting the first k most useful features to train the classifier; using naive Bayes polynomials Bayes Multinomial) and linear Logistic regression (Simple Logistic) to train the corresponding sub-classifier; and integrating and predicting the prediction result through the sub-classifier voting rule to obtain accuracy (precision) and recall (recall), and finally calculating an F1 value (F1-score) as a subsequent evaluation standard by combining the accuracy and the recall. Finally, the features which frequently occur in the experimental process and have high information gain values are clustered by a clustering method, so that the detected technical debt is classified.
Step 1: preprocessing the feature words
The invention considers that some characteristic words are invalid, such as stop words, punctuation marks and the like, and also needs to consider the similarity of words, for example: happy, happiness, happier, etc. have similar contents of the same stem, so the words are unified to stem using the Porter stem algorithm.
The invention uses heuristic rules to process the original annotation data:
(1) Deleting automatically generated license description class notes with a fixed format by the compiler, such as notes of automatically generated functions such as constructors and automatically generated catch code notes blocks. Since the comments before the class declaration also generally do not contain self-supporting technical liabilities, the comments before the class declaration are also deleted.
(2) Instead of directly using comment blocks, developers sometimes write a long comment in multiple lines. This annotation writing can cause one long annotation to be mistaken for multiple annotations. The multiple lines of annotations in this case are thus combined into one sentence.
(3) In a software project, there are many source codes that exist in the form of annotations. These codes are annotated, on the one hand, because the codes are not used and, on the other hand, because the codes are only used for debug. These codes present in the annotation statement typically do not contain self-bearing technical liabilities and therefore they can be deleted.
(4) Javadoc notes typically do not have self-bearing technical liabilities, whereas some javadocs containing self-bearing technical liabilities typically contain reserved words, such as "todo", "fixme" or "XXX". The invention deletes Javadoc without reserved word, and reserves annotation sentence with reserved word.
Step 2: selecting the first k most useful features to train the classifier
After text pre-processing the source item annotation, the present invention uses a Vector Space Model (VSM) to process words that have been partitioned into features. In this model, each sentence annotation is represented by a word vector, and the segmented word features can be considered dimensions, and each sentence annotation can be considered a data point in a high-dimensional space. The invention uses HashMap as the mapping of VSM model, wherein character type mark is divided feature, double-precision type numerical value is word frequency, namely the number of times of feature in current annotation, and the number of times is standardized.
As can be seen by reading the source item notes, each source item is characterized by a number of features through text preprocessing, such as 3661 features in ArgoUML items. When using a space vector model, too high dimensions can lead to reduced experimental performance and even impact on the final results. And through manual simple analysis, the problem that the notes of the self-bearing technical debt are small in number, namely the number of the notes is unbalanced, can be obtained, and the difficulty is also increased for detecting the self-bearing technical debt.
To solve the above problems, the present invention uses feature selection to extract a subset of features that are most useful for classification when performing self-bearing technical debt detection. Feature selection can significantly improve classifier classification performance based on previous studies and practical conditions. The present invention employs information gain, a widely used feature selection method, to select useful features.
Let the annotation dataset be represented as C = { (C 1,L1),(C2,L2),...,(CN,LN)},Ci represents the ith annotation and L i represents the class label of that annotation, i.e. (t) noThere are self-bearing technical liabilities. Let C i={w1,w2,…,wn be taken, where n represents the number of features in note C i and w i represents the ith feature in the sentence. For a feature w and a note C i, there are 4 possible relationships between them:
· (w, t): note C i contains feature w and there is a self-bearing technical liability (i.e., t) in the sentence note
·Note C i contains feature w, but no self-bearing technical liability/>, is present in the sentence note
·Note C i does not contain feature w, but there is a self-bearing technical liability in this sentence note (i.e., t)
·Note C i contains no feature w and no self-bearing technical liability/>, is present in the sentence note
Based on the 4 possible relationships described above, the information gain for feature w and tag t is calculated as follows:
where p (w ', t') represents the probability that feature w 'appears in the annotation with tag t', p (w ') represents the probability that feature w' appears in the annotation, and p (t ') represents the probability that the annotation has tag t'.
The information gain may measure the amount of information needed to predict a class label given that a feature is in the current annotation under test. After the information gain value corresponding to each feature is calculated by using the information gain method, the features are ordered from big to small according to the size of the information gain value. The higher the score, the more important the description feature is in predicting classification tags. The invention selects the features with the information gain value of the first k% and discards the other features. In this way, the number of features in the model building stage can be reduced, while also reducing the number of features in the prediction stage, which will greatly improve the experimental efficiency of the present invention. By default, the present invention empirically selects the first 10% of the total number of features, which allows for near-optimal experimental results.
Step 3: training sub-classifiers using naive Bayes polynomials and linear Logistic regression
(1):Native Bayes Multinomial
The experiment sets No. 2, no. 3, no. 4, no. 5, no.6 and No. 8 classifiers as polynomial naive Bayes classifierBayes Multinomial) were trained using the NBM method. NBM is a concept that emphasizes polynomial distribution on a Naive Bayesian (NB) basis, and its principle is similar to NB, and belongs to Bayesian class methods. The main advantage of training a classifier using the bayesian-type approach is its short computation time and high training performance, since it assumes that a given tag and feature are conditional independent. Thus, let the annotation set be C i={w1,w2,…,wn }, the class label be L i, available:
applying the bayesian theorem on equation (3) yields:
The class label of the annotation can be identified by equation (4). Note that in NB, only whether a feature exists in the current annotation to be measured is considered, NBM is similar to NB, but the class label is determined by the number of times each feature appears in the annotation. From the summary and experiments, it can be seen that the use of NBM performs better than the use of NB method when certain specific features occur a large number of times in the annotation set.
(2):Simple Logistic
The experiment set two classifiers, namely classifier No. 1, no. 7, as linear Logistic regression classifiers (Simple Logistic). The Simple Logistic method is based on Simple Logistic regression, a Logitboost algorithm is used for carrying out multiple iterations, and parameters of a basic weak classifier are optimized through each iteration, so that a high-accuracy model is finally formed. The iteration number of Logitoost is 10 under the default condition, and if the iteration effect is not good, the optimal iteration number can be obtained by using K-fold cross validation (K-Folder Cross Validation).
Let the annotation dataset be denoted C = { (C 1,L1),(C2,L2),...,(CN,LN) }, where C i represents the ith annotation and L i represents the classification label of the annotation, i.e. whether there is a self-bearing technical liability. In addition, let C i be denoted as C i={w1,w2,…,wn, where n represents the number of features in note C i and w i represents the ith feature in the note. According to the linear logistic regression theorem, it is possible to:
z=θ1w12w2+…+θnwn0=θTCi (5)
it is brought into a sigmoid function, which is expressed as follows:
And dividing the annotation to be detected into two classes according to the final result of the sigmoid function, wherein the annotation statement with the classification label value of 1 is the annotation statement with the self-bearing technical debt.
Step 4: sub-classifier voting rules
Because the data training process is divided into a plurality of sub-classifier prediction classification label results, the final accuracy of the integrated classifier can be obviously improved under the condition that the accuracy of each sub-classifier is ensured. In the invention, the classification label result predicted by a plurality of sub-classifiers is taken as the final prediction result of the integrated classifier.
Step 5: clustering self-bearing technology debt classification
The invention carries out statistics and processing on the original data according to the frequency of the occurrence of the characteristics, the position of the occurrence of the characteristics and the characteristics of the developer through the characteristics selected by the information gain values in the steps, and finally classifies the characteristic words by using a clustering method.
Firstly, counting the characteristic frequency in the source code, and selecting characteristic words with high frequency occurrence. And then, taking into account the artificial preference, deleting some features that represent that technical debts are not present but are cited. There are also some words of perceptual color which rarely appear in notes where no self-bearing technical debt exists, and therefore, a reservation is made for these needs. Finally, some morbid verbs are deleted. The selection criteria are that features with larger influence weight on final classification detection are reserved, and feature screening with small influence factors obtained by some analysis is deleted.
The research method framework of the invention is divided into two stages: a model construction stage and a model prediction stage. In the model building phase, source items are first entered as training data sets, annotations in these source items having known classification labels, and then sub-classifiers are built for each individual source item. In the forecasting phase, all sub-classifiers are integrated to jointly forecast whether there are self-admitted technical liabilities for annotations in the target. In order to make the results as accurate as possible, only one item is selected as the target item at a time for prediction, the other n-1 items are input into the model as source items, causing them to train the sub-classifier.
Step one: text preprocessing
Before selecting the characteristic value, the original project annotation is subjected to text preprocessing. The reason for this is that the required feature is some core words, and there are also a large number of punctuation marks, stop words, etc. in the source item annotation. In addition, a plurality of words have the same word stems, and in the classification problem, the words can be simplified into unified words for representation, so that the classification effect is not affected, the classification efficiency can be improved, and the text processing can be divided into 3 steps:
(1) Characterizing: the source project annotation text is divided into words, phrases, symbols, or other meaningful elements. The experiment only retains the character containing english letters, that is to say all punctuation marks are deleted first, and furthermore, there are some word features with punctuation marks or numbers attached, and it is also necessary to delete these characters and only retain words, for example: "TODO:" to be characterized as "TODO". Finally, all word features are converted into lowercase.
(2) Disabling word deletion: stop words are a class of words that are often used when annotating, and have little meaning to detect self-bearing technical liabilities to be addressed by the present invention, because stop words have no word sense of actual meaning to identify self-bearing technical liabilities. Common disuse words include "I", "should", "to", "the" and the like. While there are many text-mining efforts that provide a standard list of stop words, some stop words are actually useful for classification against the self-bearing technical liability detection problem. For example, a note of a self-bearing technical debt is "TODO: should HAVE AN IMAGE of a wizard or some logo", where the phrase "should have" carries information useful for classification, and the two words "should" and "have" are usually defaulted as stop words. Thus, the present invention creates a deactivated word list for self-bearing technical liability detection problems that includes only a small number of prepositions (e.g., "the", "to", "of", "is", etc.) that are not useful for classification. Words with a length of no more than 2 or more than 20 are also considered stop words by the present invention.
(3) Word drying: word drying is the process of unifying words (sometimes derivatives) into stems, roots, or shapes. For example, the words "stems" and "stemmed" will be simplified to "stem". The invention uses a well-known Porter stem analyzer to implement word desiccation, thereby reducing redundant synonyms.
Step two: eigenvalue selection
The features after pretreatment can not be all used for training a classifier, the classifying efficiency of the classifier is low, and noise is increased due to excessive number of features;
Shannon first describes the amount of information and selects useful features by using a widely used feature method, information gain.
The information amount is defined using a logarithmic function, i.e. for event x, the probability of its occurrence is p (x), then the information amount corresponding to event x is defined:
I(x)=-logp(x) (7)
As can be seen from the formula (7), the magnitude of the information amount represents the magnitude of uncertainty of occurrence of the event, and the smaller the uncertainty of occurrence of the event is, the smaller the information amount is; conversely, the greater the uncertainty in the occurrence of an event, the greater the amount of information that is present.
Shannon, in turn, proposes the definition of information entropy. Information entropy is a measure of the amount of information needed to eliminate uncertainty. It represents a desire to consider all possible cases of an event, the amount of information that an event may produce. Colloquially, information entropy is a measure of the size of an amount of information. I.e. for the variable x= { X 1,x2,…,xi,…,xn }, its information entropy is defined as:
H(x)=-i=1np(xi)logp(xi) (8)
Where p (X i) represents the probability that the random variable X is X i. The information entropy only depends on the distribution of the random variable X, and is irrelevant to the value of X. The magnitude of entropy may indicate the magnitude of uncertainty in probability when the random variable is a certain value. The variation curve of the information entropy with probability is shown in fig. 1.
As shown in fig. 1, when p (x i) =0 or 1, H (x i) =0, i.e., uncertainty is 0; when p (X i) =0.5, H (X i) =1, the entropy reaches the maximum, that is, the uncertainty is the largest when the random variable X takes X i.
If a certain precondition is added to the occurrence of the event, the conditional entropy H (x|y) can be obtained, which represents the result that the random variable X is averaged after the random variable Y is taken over all possible cases, namely:
bringing formula (10) into formula (9) gives the final result of conditional entropy:
After knowing the information entropy and the conditional entropy, the information gain can be defined in both of them. The definition of the information gain is: the difference between the entropy of the information set to be classified and the conditional entropy of the information after a certain feature is selected is that:
IG(X|Y)=H(X)-H(X|Y) (12)
The information gain can be easily understood by the above formulas and related concepts. For a certain set of features x= { X 1,x2,…,xi,…,xn }, H (X) is determined, and the smaller the uncertainty of feature X, the larger the information gain, indicating that the feature performs better, if the condition is Y. Feature selection is performed by calculating the information gain of the features, and the first k features with higher IG values are generally selected or a certain threshold is set for screening.
The data information in the dataset is represented as features and annotations, and analyzed to find that there may be four relationships: including features and having self-bearing technical liabilities; including features and no self-bearing technical liabilities; no feature is included but there is a self-bearing technical liability; no features are included and no self-bearing technical liabilities exist.
Based on the four possible relationships, the information gain of the feature w and the tag t is calculated as follows:
where p (w ', t') represents the probability that feature w 'appears in the annotation with tag t', p (w ') represents the probability that feature w' appears in the annotation, and p (t ') represents the probability that the annotation has tag t'.
After the information gain corresponding to each feature is calculated by using the information gain method, the features are sorted from large to small according to the information gain, the features of the first k% are selected, and other features are discarded. In the experiments of the invention 10% of the total number of features was selected empirically.
Step three: training sub-classifier
Using naive Bayes polynomial based on the selected eigenvalueBayes Multinomial) and linear Logistic regression (Simple Logistic) to train the corresponding sub-classifier. And finally, the sub-classifiers are combined together to form an integrated classifier integrated by the multiple classifiers, and the data to be tested are predicted.
Training is performed on some sub-classifiers by using a naive Bayesian method, classification labels are L i on an annotation set C i={w1,w2,…,wn, labels and features are independent of each other and are available to be independent of conditions, and then classification and labels representing annotations are obtained on the basis of the independent conditions by using a Bayesian theorem.
Polynomial naive Bayes classifierBayes Multinomial) is a specific example of a naive bayes classifier. The naive bayes classifier emphasizes the independence of events under certain conditions, while the polynomial naive bayes classifier emphasizes the fact that the events obey polynomial distribution, which are similar in principle.
The naive Bayes classifier algorithm is a Bayes rule-based classification algorithm. It has the following preconditions: it is assumed that given Y, the events x= { X 1,x2,…,xi,…,xn } are independent of each other. This assumption greatly simplifies the representation of P (X|Y) and simplifies the problems encountered when evaluating the dataset. For event x= { X 1,x2,…,xi,…,xn }, X i are independent of each other given condition Y, thereby yielding:
It is generally assumed that Y is any discrete variable and that event x= { X 1,x2,…,xi,…,xn } is any discrete or real-valued variable. In training the classifier, the objective is to output a probability distribution of possible values of Y for each instance x i that needs to be classified. According to the Bayesian rule, the probability expression that Y will take its kth possible value is:
now, assuming that x i is independent of condition Y, formula (2-8) can be rewritten as:
Equation (16) is the basic equation of the polynomial naive Bayes classifier. P (y=y k|x1,x2,…,xn) is called a posterior probability.
Now give a new instance X new={x1,x2,…,xn from the dataset and give a priori probability P (Y) and a conditional probability P (X i |y), if the most likely Y value (i.e. class label) is to be found, then the naive bayes classifier rule gives the following rule:
The formula (17) is generally simplified as follows:
That is, y k, which maximizes the value of the formula (18), is the final classification result.
Other sub-classifiers were trained using linear logistic regression above, which iterated on a simple logic basis with the addition of the LogitBoost algorithm, and the best number of iterations was obtained using K-fold cross validation (K-Folder Cross Validation). The prediction process is that the annotated data set is constructed into a linear logistic regression form and is brought into a sigmoid function, and finally, label classification of 0 and 1 is carried out by the final result of the sigmoid function to predict whether the annotated data set is an annotated sentence with self-bearing technical debt.
First for a given dataset D = { (x 1,y1),(x2,y2)…(xm,ym) }, where (x i,yi) represents the i-th sample, where x i={xi1,xi2,…,xin }, i.e. each data has n features; classification tag y i ε {0,1}. Assume that the n features of x i are linear, i.e.:
z=θxi+b=θ1xi12xi2+…+θnxn1+b (19)
For simplicity of representation, b in formula (19) is written as θ 0, available:
z=θ1xi12xi2+…+θnxn10=θTX (20)
The invention aims to realize classification problems, and hopes that the final function can intuitively display classification results, so that a sigmoid function is adopted, and the function is expressed as follows:
Bringing formula (20) into formula (21) yields:
The function curve is shown in fig. 3:
from the graph, the value range of the sigmoid function is [0,1], and the final result is judged to be 0 when the final result is judged to be 1 when y is more than 0.5 and the final result is judged to be 0 when y is less than 0.5, so that the two classification can be clearly realized. If further let:
P(y=1|x;θ)=hθ(x) (23)
P(y=0|x;θ)=1-hθ(x) (24)
the loss function of logistic regression can be obtained:
And (3) obtaining the minimum value theta of the result of the formula (25), namely the parameters in the formulas (2-15), thereby obtaining the function representation of the final logistic regression classifier.
The Simple Logistic classifier adopted by the invention uses Logitoost algorithm based on training a weak classifier by using Logistic regression. LogitBoost was first proposed by Schapire and Singer as one of the enhancement algorithms developed in recent years. Boosting type algorithms were originally designed to combine several weak classifiers to improve classification performance, and Freund and Schapire later proposed a more practical enhancement algorithm AdaBoost, which presented the problem of overfitting when processing noisy data. For this case, friedman et al propose a LogitBoost algorithm to linearly reduce training errors.
The logic algorithm thinking is as follows:
input dataset d= { (x 1,y1),(x2,y2)…(xn,yn) }, where x i∈X,yi e y= { -1,1}; the iteration number T is entered.
Initializing weightsF (x) =0, probability/>
Repeat for iteration times t=1, …, T:
a. Calculating weight and strain:
wi=p(xi)*[1-p(xi)] (26)
b. Using w i as a weight, fitting the weak classifier function f t(x),f(xi) by using a weighted least square method to represent the functional form of the weak classifier. The invention adopts a logistic regression function to train the weak classifier.
C. Updating F (x) and p (x) for the round of iterations:
The final classification LF (x) =sign [ F (x) ].
Step four: sub-classifier voting rules
In the prediction phase, it is necessary to predict classification labels of annotations to be predicted in target items using classifiers trained on source items. Because each item has different styles of notes, the feature distribution of the notes is different, and therefore, the experiment uses each sub-classifier to construct an integrated classifier. Each sub-classifier is trained according to the characteristics of different source projects by using a method which is adaptive to own data, and each sub-classifier has independence and does not interfere with each other's prediction process. Thus, the final accuracy of the integrated classifier is also obviously improved under the condition of ensuring the accuracy of each sub-classifier. The invention takes the classification label result predicted by a plurality of sub-classifiers as the prediction result of the final integrated classifier. Thus, the prediction process is just like voting, with each sub-classifier "voting" deciding the final "winner" (i.e., annotated class label).
Table 1 gives the voting process for the sub-classifier predicting the classification tags. The columns correspond to the set of sub-classifiers and the prediction results for each sub-classifier. The last one is the final output of the integrated classifier. In the qualifying example, there are 7 sub-classifiers in total, assuming that the data used to train the 7 sub-classifiers is from 7 different source projects. Where 3 sub-classifiers are predicted to be "no self-bearing technology liabilities" (Negative) and the other four sub-classifiers are predicted to be "self-bearing technology liabilities" (Positive), the final output of the integrated classifier is "the annotation is self-bearing technology liabilities".
Table 1 sub-classifier voting examples
/>
Step five: clustering self-bearing technology debt classification
And re-analyzing the characteristics obtained by detection according to the information gain value in the detection process. The different items available from the analysis may share some common features representing self-bearing technical liabilities, such as "todo", "fixme", "workaround", "realization", "hack", etc. The frequency of occurrence of these features may vary from item to item. For example, for temporary repair problems, some developers prefer to use the term "hack" while others prefer to use the term "workaround". Furthermore, these words may also appear in notes where no self-bearing technical debt exists. For example, when the word "im element" appears in an annotation where there is no self-bearing technical liability, it means that the developer has written code that implements some function (e.g., "IMPLEMENTS BACKSPACE FUNCTIONALITY") to play an indicative role. However, in the annotation of the existence of self-admitting technical debt, the term "real" generally indicates that the developer needs to perform certain functions but has not yet completed (e.g., "Bunch of methods still not implemented").
In writing self-bearing technical debt notes, some developers prefer to use words with perceptual colors (e.g., yuck, ugly, stupid, ill, etc.). These words with perceptual color rarely appear in comments that are not in the self-bearing technical debt, but sometimes the developer wants to remind himself to avoid writing a low quality code (e.g., "guard against something really stupid"), in which case these words sometimes appear. In addition, developers like to use some stateful verbs, query words and comparison stages to note self-bearing technical liabilities. Wherein, the situational verb includes "should", "seed", "can", "would", etc., the query word includes "what", "how", "where", etc., and the comparison word includes "beer", "most", "master", etc.
By reading notes containing self-supporting technical liabilities, it can be derived that in many cases developers have to repair programs quickly or implement a function quickly in a short time. That is, many self-supporting technical liabilities are made in a state where the developer is in a state of time or emotional tension.
The invention processes and counts the original data set again, combines the 5 self-bearing technical debts proposed by Maldonado and Shihab in the research process, finally classifies the counted characteristic values by using a clustering method, and the obtained result is shown in Table 2:
TABLE 2 self-bearing technical liability exemplary feature classification
/>

Claims (2)

1. A self-bearing technology debt detection classification method based on multiparty integrated learning is characterized by comprising the following steps:
step 1: preprocessing the feature words
Processing the raw annotation data using heuristic rules:
(1) Deleting license description class notes with fixed formats automatically generated by a compiler;
(2) Merging the multiple lines of annotations into a sentence;
(3) Deleting the code present in the annotation statement;
(4) Deleting the Javadoc which does not contain the reserved word, and reserving the annotation statement containing the reserved word;
Step 2: selecting the first k most useful features to train the classifier
After text pre-processing the source item annotation, processing words that have been partitioned into features using a vector space model VSM; in this model, each sentence annotation is represented by a word vector, and the divided word features can be regarded as dimensions, and each sentence annotation can be regarded as a data point in a high-dimensional space; using HashMap as a mapping of a VSM model, wherein character type marks are divided features, double-precision type numerical values are word frequencies, namely the times of the features in the current annotation, and the times are subjected to standardization processing;
information gain is used as a widely used feature selection method to select useful features: let the annotation dataset be represented as C = { (C 1,L1),(C2,L2),...,(CN,LN)},Ci represents the ith annotation and L i represents the class label of that annotation, i.e. (t) no There are self-bearing technical liabilities; let C i={w1,w2,…,wn }, where n represents the number of features in note C i and w i represents the ith feature in the sentence of notes; for a feature w and a note C i, there are 4 possible relationships between them:
· (w, t): note C i contains feature w and there is a self-bearing technical liability (i.e., t) in the sentence note
·Note C i contains feature w, but no self-bearing technical liability/>, is present in the sentence note
·Note C i does not contain feature w, but there is self-bearing technical liability/>, in the sentence note
·Note C i contains no feature w and no self-bearing technical liability/>, is present in the sentence noteBased on the 4 possible relationships described above, the information gain for feature w and tag t is calculated as follows:
Where p (w ,t) represents the probability that feature w appears in the annotation with tag t , p (w ) represents the probability that feature w appears in the annotation, and p (t ) represents the probability that the annotation has tag t ;
after the information gain value corresponding to each feature is calculated by using the information gain method, the features are ordered from big to small according to the size of the information gain value; the higher the score, the more important the description feature is in predicting classification tags; selecting the characteristics of which the information gain value is k% in front, and discarding other characteristics;
Step 3: training sub-classifiers using naive Bayes polynomials and linear Logistic regression
(1):Native Bayes Multinomial
Six classifiers, namely No. 2, no. 3, no. 4, no. 5, no. 6 and No. 8, are set as polynomial naive Bayes classifier NBM, and training is carried out by using an NBM method; let the annotation set be C i={w1,w2,…,wn }, the classification label be L i, available:
applying the bayesian theorem on equation (3) yields:
identifying the annotated class label by formula (4);
(2):Simple Logistic
setting two classifiers, namely a No. 1 classifier and a No. 7 classifier, as a linear Logistic regression classifier; let the annotation dataset be denoted C = { (C 1,L1),(C2,L2),...,(CN,LN) }, where C i represents the ith annotation and L i represents the classification label of the annotation, i.e. whether there is a self-bearing technical liability; in addition, let C i be denoted as C i={w1,w2,…,wn, where n represents the number of features in note C i and w i represents the ith feature in the note; according to the linear logistic regression theorem, it is possible to:
z=θ1w12w2+…+θnwn0=θTCi (5)
it is brought into a sigmoid function, which is expressed as follows:
dividing the annotation to be detected into two classes according to the final result of the sigmoid function, wherein the annotation statement with the classification label value of 1 is the annotation statement with the self-bearing technical debt;
Step 4: sub-classifier voting rules
Taking the classification label result predicted by the majority of sub-classifiers as a final prediction result of the integrated classifier by adopting a voting rule;
Step 5: clustering self-bearing technology debt classification
And (3) re-screening and deleting the original data according to the frequency of the occurrence of the characteristics, the position of the occurrence of the characteristics and the characteristics of developers by the characteristics selected by the information gain values in the steps, and finally classifying the characteristic words by using a clustering method.
2. The self-bearing technical liability detection and classification method based on multi-method ensemble learning according to claim 1, wherein k% in the step 2 is 10%.
CN202010568813.6A 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning Active CN111782807B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010568813.6A CN111782807B (en) 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010568813.6A CN111782807B (en) 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning

Publications (2)

Publication Number Publication Date
CN111782807A CN111782807A (en) 2020-10-16
CN111782807B true CN111782807B (en) 2024-05-24

Family

ID=72756715

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010568813.6A Active CN111782807B (en) 2020-06-19 2020-06-19 Self-bearing technology debt detection classification method based on multiparty integrated learning

Country Status (1)

Country Link
CN (1) CN111782807B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112748951B (en) * 2021-01-21 2022-04-22 杭州电子科技大学 XGboost-based self-acceptance technology debt multi-classification method
CN112800232B (en) * 2021-04-01 2021-08-06 南京视察者智能科技有限公司 Case automatic classification method based on big data
CN113407439B (en) * 2021-05-24 2024-02-27 西北工业大学 Detection method for software self-recognition type technical liabilities
CN113313184B (en) * 2021-06-07 2024-05-24 西北工业大学 Heterogeneous integrated self-bearing technology liability automatic detection method
CN113377422B (en) * 2021-06-09 2024-04-05 大连海事大学 Self-recognition technical liability method based on deep learning identification
US11971804B1 (en) 2021-06-15 2024-04-30 Allstate Insurance Company Methods and systems for an intelligent technical debt helper bot

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7814040B1 (en) * 2006-01-31 2010-10-12 The Research Foundation Of State University Of New York System and method for image annotation and multi-modal image retrieval using probabilistic semantic models
CN107111842A (en) * 2014-12-16 2017-08-29 具珉秀 Asset management device and its operating method
CN110069252A (en) * 2019-04-11 2019-07-30 浙江网新恒天软件有限公司 A kind of source code file multi-service label mechanized classification method
WO2019217323A1 (en) * 2018-05-06 2019-11-14 Strong Force TX Portfolio 2018, LLC Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources
CN111000553A (en) * 2019-12-30 2020-04-14 山东省计算中心(国家超级计算济南中心) Intelligent classification method for electrocardiogram data based on voting ensemble learning
CN111242191A (en) * 2020-01-06 2020-06-05 中国建设银行股份有限公司 Credit rating method and device based on multi-classifier integration
CN111273911A (en) * 2020-01-14 2020-06-12 杭州电子科技大学 Software technology debt identification method based on bidirectional LSTM and attention mechanism

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8296247B2 (en) * 2007-03-23 2012-10-23 Three Palm Software Combination machine learning algorithms for computer-aided detection, review and diagnosis

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7814040B1 (en) * 2006-01-31 2010-10-12 The Research Foundation Of State University Of New York System and method for image annotation and multi-modal image retrieval using probabilistic semantic models
CN107111842A (en) * 2014-12-16 2017-08-29 具珉秀 Asset management device and its operating method
WO2019217323A1 (en) * 2018-05-06 2019-11-14 Strong Force TX Portfolio 2018, LLC Methods and systems for improving machines and systems that automate execution of distributed ledger and other transactions in spot and forward markets for energy, compute, storage and other resources
CN110069252A (en) * 2019-04-11 2019-07-30 浙江网新恒天软件有限公司 A kind of source code file multi-service label mechanized classification method
CN111000553A (en) * 2019-12-30 2020-04-14 山东省计算中心(国家超级计算济南中心) Intelligent classification method for electrocardiogram data based on voting ensemble learning
CN111242191A (en) * 2020-01-06 2020-06-05 中国建设银行股份有限公司 Credit rating method and device based on multi-classifier integration
CN111273911A (en) * 2020-01-14 2020-06-12 杭州电子科技大学 Software technology debt identification method based on bidirectional LSTM and attention mechanism

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A large scale emprirical study on self-admitted technical debt;Gabriele Bavota Etal.;2016 IEEE ACM13th workingconference;20161130;全文 *
An exploratory study on self-admitted technical debt;Potdar A Etal.;IEEE International Conference on Software Maintenance and Evolution;全文 *
三支决策朴素贝叶斯增量学习算法研究;韩素青;成慧雯;王宝丽;;计算机工程与应用(第18期);全文 *
利用PCA和AdaBoost建立基于贝叶斯的组合分类器;陈松峰;范明;;计算机科学;20100815(第08期);全文 *
软件集成开发环境的技术债务管理研究;刘亚珺等;计算机科学;第44卷(第11期);全文 *

Also Published As

Publication number Publication date
CN111782807A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
CN111782807B (en) Self-bearing technology debt detection classification method based on multiparty integrated learning
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
CN110209806B (en) Text classification method, text classification device and computer readable storage medium
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN107193959B (en) Pure text-oriented enterprise entity classification method
CN107992597B (en) Text structuring method for power grid fault case
Weiss et al. Structured prediction cascades
US20210319180A1 (en) Systems and methods for deviation detection, information extraction and obligation deviation detection
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN112307741B (en) Insurance industry document intelligent analysis method and device
CN112966068A (en) Resume identification method and device based on webpage information
CN111930939A (en) Text detection method and device
Ababneh Investigating the relevance of Arabic text classification datasets based on supervised learning
Jiang et al. Impact of OCR quality on BERT embeddings in the domain classification of book excerpts
JP2005181928A (en) System and method for machine learning, and computer program
CN114861636A (en) Training method and device of text error correction model and text error correction method and device
Kaminska et al. Fuzzy rough nearest neighbour methods for detecting emotions, hate speech and irony
CN115221332A (en) Construction method and system of dangerous chemical accident event map
CN109977391B (en) Information extraction method and device for text data
CN117422074A (en) Method, device, equipment and medium for standardizing clinical information text
CN112667819A (en) Entity description reasoning knowledge base construction and reasoning evidence quantitative information acquisition method and device
Sheng et al. A paper quality and comment consistency detection model based on feature dimensionality reduction
CN115496630A (en) Patent writing quality checking method and system based on natural language algorithm
CN115934936A (en) Intelligent traffic text analysis method based on natural language processing
Heap et al. A joint human/machine process for coding events and conflict drivers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant