CN112231706A

CN112231706A - Security vulnerability report data set construction method based on voting mechanism

Info

Publication number: CN112231706A
Application number: CN202011074609.5A
Authority: CN
Inventors: 吴潇雪; 郑炜; 陈智通; 栾文飞; 慕德俊
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-01-15

Abstract

The invention relates to an automatic data labeling method based on iterative voting classification. Secondly, training three different classifiers by using the initial labeled sample, predicting a target data set by using the trained three classification models respectively, adding the data with the prediction results of the three classifiers consistent as negative samples into the initial labeled sample, and entering the next iteration. And finally, verifying the accuracy of the automatic sample marking result of the model. Experiments prove that the method can effectively improve the marking accuracy of the security vulnerability data set, and F1-score can reach 0.91.

Description

Security vulnerability report data set construction method based on voting mechanism

Technical Field

The invention belongs to the field of software security assurance in software testing, and relates to a security vulnerability prediction method, a data marking method, a data set construction method and the like.

Background

Machine learning-based security vulnerability report identification is receiving increasing attention from both academic and industrial circles, and high-quality labeled datasets are a prerequisite for machine learning model applications. Recently, the document Peter (pets, f., Tun, t., Yu, y., Nuseibeh, b.: Text filtering and ranking for security report compression. ieee Transactions on Software Engineering 45(6),615 and 631(2019)) proposed a noise data filtering method named farrec for the problem of false targeting of vulnerability report detection data sets. The method comprises two main steps:

the method comprises the following steps: and extracting safety related words. And extracting safety related key words from the safety loophole report by using a TF-IDF method.

Step two: the noise data is filtered. And (4) calculating the similarity between the non-security vulnerability report and the security related vocabulary obtained in the step one, and filtering data with higher similarity.

However, the false recognition rate of the noise data is high, so that many non-noise data are filtered, and great information loss is caused, so that the accuracy of detecting the security vulnerability report by using the model trained by the filtered data set is very low, and the average value is less than 50%.

Disclosure of Invention

Technical problem to be solved

In order to improve the accuracy of security vulnerability report detection in the prior art, the invention provides a security vulnerability report data set construction method based on a voting mechanism.

Technical scheme

A security vulnerability report data set construction method based on a voting mechanism is characterized by comprising the following steps:

step 1: initial training sample labeling, this phase comprising two inputs and two outputs:

inputting: CVE data and unlabeled sample B_all；

And (3) outputting: is markedInitial training sample set B_lAnd the remaining unlabeled sample set B_u；

Step 1.1: positive sample labeling based on CVE: marking the record associated with the CVE in the defect report of the software product as a 'positive' sample to finally obtain a set B of all marked positive samples of the software product_pos(ii) a Recording marked as positive samples from the software product Defect report set B_allRemoving to obtain the residual unlabeled sample set B_left；

Step 1.2: obtaining an initial label "negative" sample based on the levenstein distance: by calculating the remaining unmarked defect report B_leftWith each record in (B) and marked positive sample_posThe first 50 records with the maximum Levensan distance are extracted as initial mark 'negative' samples to form an initial mark negative sample set B_neg(ii) a Recording marked as negative examples from the software product Defect report set B_leftThe remaining unlabeled sample set is used as the target sample set B_u；

Step 1.3: marking the initial positive sample set B_posAnd set of initial labeled negative examples B_negCombining to form an initial training sample set B_l；

Step 1.4: output initial training sample set B_lAnd unlabeled target sample set B_u；

Step 2: iterative auto-vote classification: an iterative vote classification method is proposed, comprising 3 inputs and 3 outputs;

inputting: labeled training sample set B_l(ii) a Unlabeled target data set B_u(ii) a Three classifiers;

and (3) outputting: a sample set Bppos predicted to be positive; a sample set Bpneg predicted to be negative; an uncertain sample set Bpu;

step 2.1: model training: respectively training the three classifiers by using the marked training samples;

step 2.2: voting type target data automatic marking: respectively predicting target data through the three models trained in the step 2.1, transferring data marked as negative by the three classifiers simultaneously from the target data to a training sample, and expanding the number of negative samples in the training sample; judging whether an iteration exit condition is met, if so, entering a step 2.3; otherwise, entering step 2.1;

step 2.3: and (3) outputting the target data automatic marking result: extracting data of which the prediction results of the three classifiers are positive at the same time to form a positive sample set; continuously adding data of training samples in iteration to form a negative sample set; taking the residual data as an uncertain sample set; and respectively outputting the three sets to a specified file.

The technical scheme of the invention is further that: the Levensstein distance in the step 1 is one of editing distances, and refers to the minimum change operation step required for changing from one character string to another character string; the Levensan distance lev between two character strings a, b_a,bThe calculation formula of (| a |, | b |) is shown in formula (1):

wherein,

is an index function when a_i＝b_jWhen the temperature of the water is higher than the set temperature,

equal to 0; otherwise, equal to 1; lev_a,b(i, j) is the distance between the first i characters of string a and the first j characters of string b; i and j are indices with step size 1; by calculating the remaining unmarked defect report B_leftWith each record in (B) and marked positive sample_posThe levenstein distance in between.

The technical scheme of the invention is further that: the first 50 records in step 1.1 were selected as the "negative" sample of the initial mark.

The technical scheme of the invention is further that: the 3 classifiers in step 2 are respectively polynomial naive Bayes MNB, logistic regression LR and multilayer perceptive neural network MLP.

Advantageous effects

The invention provides an automatic data labeling method based on iterative voting classification, which comprises the steps of firstly preparing an initial marking sample, utilizing data of an international authoritative vulnerability report library CVE (common Vulnerabilities expose) to assist positive sample marking, and selecting a small number (for example, 50) of high-quality non-security vulnerability related reports from an existing report library as negative samples. Secondly, training three different classifiers by using the initial labeled sample, predicting a target data set by using the trained three classification models respectively, adding the data with the prediction results of the three classifiers consistent as negative samples into the initial labeled sample, and entering the next iteration. And finally, verifying the accuracy of the automatic sample marking result of the model. Experiments prove that the method can effectively improve the marking accuracy of the security vulnerability data set, and F1-score can reach 0.91.

The present invention constructs large scale SBR predictive datasets using only a small set of initially labeled samples. On one hand, the method can improve the noise identification accuracy in the existing data, and avoids the information loss problem caused by data filtering through automatic marking; in another aspect, the method may be used to build a new security breach report detection data set. By using the method, a researcher can construct a large-scale high-quality data set for detecting the security vulnerability report from an open source project only by investing a small amount of work, so that the early discovery and repair of the security vulnerability report are promoted.

Drawings

FIG. 1 is a framework of a data set construction method.

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

step 1: initial training sample labeling. This stage includes two inputs and two outputs:

inputting: CVE data and unlabeled sample B_all；

And (3) outputting: labeled initial training sample set B_lAnd the remaining unlabeled sample set B_u。

Step 1.1: positive sample labeling was performed based on CVE. Since the CVE is an international authority and leakage database, any defect report associated with CVE data in a software product defect tracking system is a determined security vulnerability related defect report. Based on this, if a defect report is associated with any one or more CVE records, i.e. there is an associated CVE number for the defect report, or the associated defect report number is explicitly marked in the details of the CVE record, the defect report is marked as "positive" sample. Finally, obtaining a set B of all marked positive samples of the software according to the product_pos. Recording marked as positive samples from the software product Defect report set B_allRemoving to obtain the residual unlabeled sample set B_left。

Step 1.2: an initial marker "negative" sample is obtained based on the Levenshtein distance (Levenshtein distance). The levenstein distance is a type of Edit distance (Edit distance), which refers to the minimum number of changing operation steps required to change from one string to another. The Levensan distance lev between two character strings a, b_a,bThe calculation formula of (| a |, | b |) is shown in formula (1).

Wherein,

equal to 0; otherwise, it equals 1. lev_a,b(i, j) is the distance between the first i characters of string a and the first j characters of string b. i and j are indices with step size 1. By calculating the remaining unmarked defect report B_leftWith each record in (B) and marked positive sample_posThe first 50 records with the largest Levensan distance are extracted as the initial mark "negative" samples。

In order to improve the calculation accuracy, the method firstly adopts an NLTK tool to collect a marked positive sample set B_posExtracting Key words Key of the first 100 Key words_pos(ii) a And is B_leftExtracts the first 100 Key words from each record to obtain a set sequence { Key₁,Key₂,...,Key_mWherein m is B_leftNumber of defect reports in (1). In the process of extracting key words by using NLTK, stop words such as a, the and this and other words which have high occurrence frequency but have no practical meaning are removed firstly. Then, calculate the set sequence { Key₁,Key₂,...,Key_mEvery element and Key in_posThe Levensan distance therebetween to obtain a distance sequence { Dis₁,Dis₂,...,Dis_m}。

Finally, for { Dis₁,Dis₂,...,Dis_mThe large-to-small sequence of the elements in the sequence is sorted, and the first 50 elements are taken. The defect report record corresponding to these 50 elements is marked as "negative" sample, and is used as the negative sample set B in the initial training sample_neg. Recording marked as negative examples from the software product Defect report set B_leftThe remaining unlabeled sample set is used as the target sample set B_u。

Step 1.3: marking the initial positive sample set B_posAnd set of initial labeled negative examples B_negCombining to form an initial training sample set B_l。

Step 1.4: output initial training sample set B_lAnd unlabeled target sample set B_u. Initial training sample set B_lOutputting the default data to a train.csv file; unlabeled target sample set B_uDefault output into file target.

Step 2: iterative vote-type automatic classification. Based on the sparse feature of security vulnerability related flaw report distribution in the flaw report, an iterative automatic voting classification method is designed by utilizing 3 different text classifiers, and the iterative automatic voting classification method comprises 3 inputs and 3 outputs.

Inputting: labeled training sample set B_l(ii) a Unmarked target dataSet B_u(ii) a Three classifiers.

And (3) outputting: a sample set Bppos predicted to be positive; a sample set Bpneg predicted to be negative; the sample set Bpu is not determined.

Step 2.1: and selecting a classifier. The invention selects 3 classifiers with best performance from 5 classifiers adopted by Peter (Peters, F., Tun, T., Yu, Y., Nuseibeh, B.: Text filtering and ranking for security distribution prediction. IEEE Transactions on Software Engineering 45(6),615 and 631(2019)) for use in the voting algorithm. Respectively polynomial Naive Bayes (MNB), Logistic Regression (LR) and multilayer Perceptron (MLP).

Step 2.2: and (5) training a model. Using labeled training sample set B_lThree classifiers MNB, LR and MLP were trained separately. Because the Description information (Description) of the defect report is in a natural language Description form, the data needs to be preprocessed before the model training. The method firstly extracts text features from Description information 'Description' of a defect report, converts the text features into a token matrix, and removes 'CountVectorizer' pause words provided by scinit-lern, such as 'a', 'the' and the like. Secondly, the method of combining SelectFromModel () and Linear SVC () in scinit-spare is used for carrying out feature selection and dimension reduction processing. Finally, the classifiers MNB, LR and MLP are trained separately using the obtained secondary token matrices.

Step 2.3: the target data is automatically voted for marking. First, a target data set B is subjected to "CountVectorizer", SelectFromModel () and Linear SVC () supplied from scinit-spare_uThe same steps of data pre-processing are performed. Next, the three classifiers MNB, LR and MLP trained in step 2.3 are used to predict the target data, respectively. For the predicted results of three classifiers, if there is data that is marked negative by three classifiers at the same time, it is removed from the target data set B_uTransferred into training samples and marked as negative samples, resulting in a training data set B_lThe number of negative samples in (1) is expanded. Determining whether an iteration exit is metThe criterion specifically includes two stopping criteria:

there are no three classifiers in the current iteration that predict the appearance of a negative sample at the same time. Because the goal of each iteration is to move the samples for which three classifiers predict negative simultaneously from the target dataset into the training set, the iteration stops when there are no samples for which three classifiers predict negative simultaneously in a loop.

And the residual data amount reaches a set threshold value f. Setting a minimum target data set quantity threshold of an iterative loop, wherein the proportion of the security vulnerability reports is about 5% -20% of the total number of the flaw reports according to the statistics of a large number of open source items, and the threshold is calculated to be the maximum value, wherein the calculation formula is shown as a formula (2).

f＝0.2*len(B_u) (formula 2)

If the current condition meets any one of the stopping criteria of the first and the second, the iteration exits, and the processing process enters the step 2.4; otherwise, step 2.2 is entered again for the next iteration.

Step 2.4: and outputting the target data automatic marking result. Extracting data with the prediction results of the three classifiers being positive at the same time to form a prediction positive sample set Bppos; continuously adding data of training samples in iteration to form a prediction negative sample set Bpneg; the remaining data is data for which the three classifiers cannot achieve consistent results, and is used as the uncertain sample set Bpu.

Evaluation of results

And comparing the iterative automatic voting classification result with the result of independent classification of each classifier by adopting 3 public data sets and a plurality of classical classification evaluation indexes, and comprehensively evaluating the effectiveness of the iterative automatic voting classification method.

The validity of the voting classification method is verified by comparing the independent use effect of the voting classification method designed by the invention and three classification methods (MNB, LR and MLP).

The model evaluates the data set. The performance of the automatic classification method for the data sets of Derby, Chromium and OpenStack is evaluated. The distribution of the initial labeled sample and the test sample (i.e., the target data set) of the three data sets is shown in table 1.

Table 1 distribution of performance evaluation experimental data sets

And evaluating the indexes. In order to effectively evaluate the performance of different classifiers, the multidimensional evaluation is carried out by using Recall, Precision, F1-score, Accuracy and a statistical analysis method. Specific descriptions of these indices are as follows.

For a sample data, there are four possible classification results:

-TP (true Positive): indicating that the positive sample prediction result is positive;

-FP (false Positive): indicating that the negative sample prediction result is positive;

-tn (true negative): indicating that the negative sample prediction result is negative;

-fn (false negative): indicating that the positive samples are predicted to be negative.

According to the classification result, each evaluation index and a specific calculation formula are as follows:

recall (Recall): the parameter represents the proportion of the positive samples in all the positive samples correctly predicted by the model, and the specific calculation formula is as follows:

accuracy (Precision): the parameter represents the proportion of the model to correctly predict the positive samples in all the predicted positive samples, and the specific calculation formula is as follows:

f1 value (F1-score): this parameter is a combination of the above mentioned fair share of accuracy and recall. Higher F1 values mean more accurate model prediction, and the specific calculation formula is as follows:

accuracy: indicating the true proportion of the prediction, also referred to as the success rate.

Statistical analysis method: to further assess whether the effect of this method is statistically more significant than other methods, a Wilcoxon signed rank test (Wilcoxon rank-sum test) was used to analyze the statistically significant differences between the different models. Cliff's delta was used to measure the amount of difference between two non-parametric variables by calculating the Effect size of different models of F1-score and Precision as follows:

where W is the Wilcoxon rank sum test, and m and n are the two variables that need to be compared to each other, respectively. The table for the importance and value of d is shown in table 2:

TABLE 2 level and Cliff's delta range correspondences

Rank of	Cliff's delta range
		Can be ignored	\|d\|<0.147
Small	0.147≤\|d\|<0.33
		Medium and high grade	0.33≤\|d\|<0.474
Big (a)	0.474≤\|d\|

(iii) results of the experiment

(a) And (3) automatically marking the result: the method provided by the invention is applied to the three data sets shown in the table 1, and the target data set is automatically marked. The three outputs for each item are shown in table 3. The prediction results of Derby, OpenStack and Chromium are that the proportion of the negative sample Bpneg accounts for 81.40, 99.39% and 98.72% of the total amount of the target data respectively; while the number of positive samples Bppos was predicted to be only 12.44, 0.16% and 1.15%. This positive and negative sample ratio is consistent with the fact that the ratio of security vulnerability related reports in the actual project is small. And for uncertain samples Bpu, the proportion of the samples is 6.16%, 0.45 and 0.13%.

TABLE 3 iterative voting classification results statistics for target data sets

Data set	Bppos	Ratio (%)	Bpneg	Ratio (%)	Bpu	Ratio (%)
							Derby	107	12.44	700	81.40	53	6.16
OpenStack	129	0.16	88071	99.39	396	0.45
							Chromium	143	1.15	40956	98.72	52	0.13

(b) And (3) performance index comparison: the Precision, F1-score and Accuracy indexes of the method are all superior to those of three classification methods which are used independently.

TABLE 3 Performance evaluation results of the Classification Algorithm

In order to evaluate the statistical significance of the method, the experiment is respectively carried out 30 times on three data sets of Derby, OpenStack and Chromim by adopting a Vote classification algorithm and three single classification algorithms, statistical analysis is carried out on the data at the beginning, p-value influence magnitude grades are respectively calculated by Wilcoxon rank sum test and Cliff's delta, and the results are shown in Table 4, so that the iterative voting classification method of the invention is remarkably superior to other three classifiers.

TABLE 4 comparison of different classification algorithms p-value and impact size rating

Claims

1. A security vulnerability report data set construction method based on a voting mechanism is characterized by comprising the following steps:

inputting: CVE data and unlabeled sample B_all；

And (3) outputting: labeled initial training sample set B_lAnd the remaining unlabeled sample set B_u；

Step 1.2: obtaining an initial label "negative" sample based on the levenstein distance: by calculating the remaining unmarked defect report B_leftWith each record in (B) and marked positive sample_posThe first several records with the largest Levensan distance are extracted as the initialMarking the negative samples to form an initial marked negative sample set B_neg(ii) a Recording marked as negative examples from the software product Defect report set B_leftThe remaining unlabeled sample set is used as the target sample set B_u；

2. The method for constructing a security vulnerability report data set based on voting mechanism according to claim 1, wherein the step 1 isThe levenstan distance of (a) is one of the edit distances, referring to the least changing operation steps required to change from one string to another; the Levensan distance lev between two character strings a, b_a,bThe calculation formula of (| a |, | b |) is shown in formula (1):

wherein,

3. A voting mechanism-based security vulnerability report data set construction method according to claim 1, characterized in that the first 50 records in step 1.1 are selected as initial mark "negative" samples.

4. The method for constructing a security vulnerability report data set based on a voting mechanism according to claim 1, characterized in that the 3 classifiers in the step 2 are respectively polynomial naive Bayes MNB, logistic regression LR and multilayer perceptive neural network MLP.