CN112231706A - Security vulnerability report data set construction method based on voting mechanism - Google Patents

Security vulnerability report data set construction method based on voting mechanism Download PDF

Info

Publication number
CN112231706A
CN112231706A CN202011074609.5A CN202011074609A CN112231706A CN 112231706 A CN112231706 A CN 112231706A CN 202011074609 A CN202011074609 A CN 202011074609A CN 112231706 A CN112231706 A CN 112231706A
Authority
CN
China
Prior art keywords
sample
sample set
data
negative
positive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011074609.5A
Other languages
Chinese (zh)
Inventor
吴潇雪
郑炜
陈智通
栾文飞
慕德俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202011074609.5A priority Critical patent/CN112231706A/en
Publication of CN112231706A publication Critical patent/CN112231706A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Security & Cryptography (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to an automatic data labeling method based on iterative voting classification. Secondly, training three different classifiers by using the initial labeled sample, predicting a target data set by using the trained three classification models respectively, adding the data with the prediction results of the three classifiers consistent as negative samples into the initial labeled sample, and entering the next iteration. And finally, verifying the accuracy of the automatic sample marking result of the model. Experiments prove that the method can effectively improve the marking accuracy of the security vulnerability data set, and F1-score can reach 0.91.

Description

Security vulnerability report data set construction method based on voting mechanism
Technical Field
The invention belongs to the field of software security assurance in software testing, and relates to a security vulnerability prediction method, a data marking method, a data set construction method and the like.
Background
Machine learning-based security vulnerability report identification is receiving increasing attention from both academic and industrial circles, and high-quality labeled datasets are a prerequisite for machine learning model applications. Recently, the document Peter (pets, f., Tun, t., Yu, y., Nuseibeh, b.: Text filtering and ranking for security report compression. ieee Transactions on Software Engineering 45(6),615 and 631(2019)) proposed a noise data filtering method named farrec for the problem of false targeting of vulnerability report detection data sets. The method comprises two main steps:
the method comprises the following steps: and extracting safety related words. And extracting safety related key words from the safety loophole report by using a TF-IDF method.
Step two: the noise data is filtered. And (4) calculating the similarity between the non-security vulnerability report and the security related vocabulary obtained in the step one, and filtering data with higher similarity.
However, the false recognition rate of the noise data is high, so that many non-noise data are filtered, and great information loss is caused, so that the accuracy of detecting the security vulnerability report by using the model trained by the filtered data set is very low, and the average value is less than 50%.
Disclosure of Invention
Technical problem to be solved
In order to improve the accuracy of security vulnerability report detection in the prior art, the invention provides a security vulnerability report data set construction method based on a voting mechanism.
Technical scheme
A security vulnerability report data set construction method based on a voting mechanism is characterized by comprising the following steps:
step 1: initial training sample labeling, this phase comprising two inputs and two outputs:
inputting: CVE data and unlabeled sample Ball
And (3) outputting: is markedInitial training sample set BlAnd the remaining unlabeled sample set Bu
Step 1.1: positive sample labeling based on CVE: marking the record associated with the CVE in the defect report of the software product as a 'positive' sample to finally obtain a set B of all marked positive samples of the software productpos(ii) a Recording marked as positive samples from the software product Defect report set BallRemoving to obtain the residual unlabeled sample set Bleft
Step 1.2: obtaining an initial label "negative" sample based on the levenstein distance: by calculating the remaining unmarked defect report BleftWith each record in (B) and marked positive sampleposThe first 50 records with the maximum Levensan distance are extracted as initial mark 'negative' samples to form an initial mark negative sample set Bneg(ii) a Recording marked as negative examples from the software product Defect report set BleftThe remaining unlabeled sample set is used as the target sample set Bu
Step 1.3: marking the initial positive sample set BposAnd set of initial labeled negative examples BnegCombining to form an initial training sample set Bl
Step 1.4: output initial training sample set BlAnd unlabeled target sample set Bu
Step 2: iterative auto-vote classification: an iterative vote classification method is proposed, comprising 3 inputs and 3 outputs;
inputting: labeled training sample set Bl(ii) a Unlabeled target data set Bu(ii) a Three classifiers;
and (3) outputting: a sample set Bppos predicted to be positive; a sample set Bpneg predicted to be negative; an uncertain sample set Bpu;
step 2.1: model training: respectively training the three classifiers by using the marked training samples;
step 2.2: voting type target data automatic marking: respectively predicting target data through the three models trained in the step 2.1, transferring data marked as negative by the three classifiers simultaneously from the target data to a training sample, and expanding the number of negative samples in the training sample; judging whether an iteration exit condition is met, if so, entering a step 2.3; otherwise, entering step 2.1;
step 2.3: and (3) outputting the target data automatic marking result: extracting data of which the prediction results of the three classifiers are positive at the same time to form a positive sample set; continuously adding data of training samples in iteration to form a negative sample set; taking the residual data as an uncertain sample set; and respectively outputting the three sets to a specified file.
The technical scheme of the invention is further that: the Levensstein distance in the step 1 is one of editing distances, and refers to the minimum change operation step required for changing from one character string to another character string; the Levensan distance lev between two character strings a, ba,bThe calculation formula of (| a |, | b |) is shown in formula (1):
Figure BDA0002716230690000031
wherein,
Figure BDA0002716230690000032
is an index function when ai=bjWhen the temperature of the water is higher than the set temperature,
Figure BDA0002716230690000033
equal to 0; otherwise, equal to 1; leva,b(i, j) is the distance between the first i characters of string a and the first j characters of string b; i and j are indices with step size 1; by calculating the remaining unmarked defect report BleftWith each record in (B) and marked positive sampleposThe levenstein distance in between.
The technical scheme of the invention is further that: the first 50 records in step 1.1 were selected as the "negative" sample of the initial mark.
The technical scheme of the invention is further that: the 3 classifiers in step 2 are respectively polynomial naive Bayes MNB, logistic regression LR and multilayer perceptive neural network MLP.
Advantageous effects
The invention provides an automatic data labeling method based on iterative voting classification, which comprises the steps of firstly preparing an initial marking sample, utilizing data of an international authoritative vulnerability report library CVE (common Vulnerabilities expose) to assist positive sample marking, and selecting a small number (for example, 50) of high-quality non-security vulnerability related reports from an existing report library as negative samples. Secondly, training three different classifiers by using the initial labeled sample, predicting a target data set by using the trained three classification models respectively, adding the data with the prediction results of the three classifiers consistent as negative samples into the initial labeled sample, and entering the next iteration. And finally, verifying the accuracy of the automatic sample marking result of the model. Experiments prove that the method can effectively improve the marking accuracy of the security vulnerability data set, and F1-score can reach 0.91.
The present invention constructs large scale SBR predictive datasets using only a small set of initially labeled samples. On one hand, the method can improve the noise identification accuracy in the existing data, and avoids the information loss problem caused by data filtering through automatic marking; in another aspect, the method may be used to build a new security breach report detection data set. By using the method, a researcher can construct a large-scale high-quality data set for detecting the security vulnerability report from an open source project only by investing a small amount of work, so that the early discovery and repair of the security vulnerability report are promoted.
Drawings
FIG. 1 is a framework of a data set construction method.
Detailed Description
The invention will now be further described with reference to the following examples and drawings:
step 1: initial training sample labeling. This stage includes two inputs and two outputs:
inputting: CVE data and unlabeled sample Ball
And (3) outputting: labeled initial training sample set BlAnd the remaining unlabeled sample set Bu
Step 1.1: positive sample labeling was performed based on CVE. Since the CVE is an international authority and leakage database, any defect report associated with CVE data in a software product defect tracking system is a determined security vulnerability related defect report. Based on this, if a defect report is associated with any one or more CVE records, i.e. there is an associated CVE number for the defect report, or the associated defect report number is explicitly marked in the details of the CVE record, the defect report is marked as "positive" sample. Finally, obtaining a set B of all marked positive samples of the software according to the productpos. Recording marked as positive samples from the software product Defect report set BallRemoving to obtain the residual unlabeled sample set Bleft
Step 1.2: an initial marker "negative" sample is obtained based on the Levenshtein distance (Levenshtein distance). The levenstein distance is a type of Edit distance (Edit distance), which refers to the minimum number of changing operation steps required to change from one string to another. The Levensan distance lev between two character strings a, ba,bThe calculation formula of (| a |, | b |) is shown in formula (1).
Figure BDA0002716230690000051
Wherein,
Figure BDA0002716230690000052
is an index function when ai=bjWhen the temperature of the water is higher than the set temperature,
Figure BDA0002716230690000053
equal to 0; otherwise, it equals 1. leva,b(i, j) is the distance between the first i characters of string a and the first j characters of string b. i and j are indices with step size 1. By calculating the remaining unmarked defect report BleftWith each record in (B) and marked positive sampleposThe first 50 records with the largest Levensan distance are extracted as the initial mark "negative" samples。
In order to improve the calculation accuracy, the method firstly adopts an NLTK tool to collect a marked positive sample set BposExtracting Key words Key of the first 100 Key wordspos(ii) a And is BleftExtracts the first 100 Key words from each record to obtain a set sequence { Key1,Key2,...,KeymWherein m is BleftNumber of defect reports in (1). In the process of extracting key words by using NLTK, stop words such as a, the and this and other words which have high occurrence frequency but have no practical meaning are removed firstly. Then, calculate the set sequence { Key1,Key2,...,KeymEvery element and Key inposThe Levensan distance therebetween to obtain a distance sequence { Dis1,Dis2,...,Dism}。
Finally, for { Dis1,Dis2,...,DismThe large-to-small sequence of the elements in the sequence is sorted, and the first 50 elements are taken. The defect report record corresponding to these 50 elements is marked as "negative" sample, and is used as the negative sample set B in the initial training sampleneg. Recording marked as negative examples from the software product Defect report set BleftThe remaining unlabeled sample set is used as the target sample set Bu
Step 1.3: marking the initial positive sample set BposAnd set of initial labeled negative examples BnegCombining to form an initial training sample set Bl
Step 1.4: output initial training sample set BlAnd unlabeled target sample set Bu. Initial training sample set BlOutputting the default data to a train.csv file; unlabeled target sample set BuDefault output into file target.
Step 2: iterative vote-type automatic classification. Based on the sparse feature of security vulnerability related flaw report distribution in the flaw report, an iterative automatic voting classification method is designed by utilizing 3 different text classifiers, and the iterative automatic voting classification method comprises 3 inputs and 3 outputs.
Inputting: labeled training sample set Bl(ii) a Unmarked target dataSet Bu(ii) a Three classifiers.
And (3) outputting: a sample set Bppos predicted to be positive; a sample set Bpneg predicted to be negative; the sample set Bpu is not determined.
Step 2.1: and selecting a classifier. The invention selects 3 classifiers with best performance from 5 classifiers adopted by Peter (Peters, F., Tun, T., Yu, Y., Nuseibeh, B.: Text filtering and ranking for security distribution prediction. IEEE Transactions on Software Engineering 45(6),615 and 631(2019)) for use in the voting algorithm. Respectively polynomial Naive Bayes (MNB), Logistic Regression (LR) and multilayer Perceptron (MLP).
Step 2.2: and (5) training a model. Using labeled training sample set BlThree classifiers MNB, LR and MLP were trained separately. Because the Description information (Description) of the defect report is in a natural language Description form, the data needs to be preprocessed before the model training. The method firstly extracts text features from Description information 'Description' of a defect report, converts the text features into a token matrix, and removes 'CountVectorizer' pause words provided by scinit-lern, such as 'a', 'the' and the like. Secondly, the method of combining SelectFromModel () and Linear SVC () in scinit-spare is used for carrying out feature selection and dimension reduction processing. Finally, the classifiers MNB, LR and MLP are trained separately using the obtained secondary token matrices.
Step 2.3: the target data is automatically voted for marking. First, a target data set B is subjected to "CountVectorizer", SelectFromModel () and Linear SVC () supplied from scinit-spareuThe same steps of data pre-processing are performed. Next, the three classifiers MNB, LR and MLP trained in step 2.3 are used to predict the target data, respectively. For the predicted results of three classifiers, if there is data that is marked negative by three classifiers at the same time, it is removed from the target data set BuTransferred into training samples and marked as negative samples, resulting in a training data set BlThe number of negative samples in (1) is expanded. Determining whether an iteration exit is metThe criterion specifically includes two stopping criteria:
there are no three classifiers in the current iteration that predict the appearance of a negative sample at the same time. Because the goal of each iteration is to move the samples for which three classifiers predict negative simultaneously from the target dataset into the training set, the iteration stops when there are no samples for which three classifiers predict negative simultaneously in a loop.
And the residual data amount reaches a set threshold value f. Setting a minimum target data set quantity threshold of an iterative loop, wherein the proportion of the security vulnerability reports is about 5% -20% of the total number of the flaw reports according to the statistics of a large number of open source items, and the threshold is calculated to be the maximum value, wherein the calculation formula is shown as a formula (2).
f=0.2*len(Bu) (formula 2)
If the current condition meets any one of the stopping criteria of the first and the second, the iteration exits, and the processing process enters the step 2.4; otherwise, step 2.2 is entered again for the next iteration.
Step 2.4: and outputting the target data automatic marking result. Extracting data with the prediction results of the three classifiers being positive at the same time to form a prediction positive sample set Bppos; continuously adding data of training samples in iteration to form a prediction negative sample set Bpneg; the remaining data is data for which the three classifiers cannot achieve consistent results, and is used as the uncertain sample set Bpu.
Evaluation of results
And comparing the iterative automatic voting classification result with the result of independent classification of each classifier by adopting 3 public data sets and a plurality of classical classification evaluation indexes, and comprehensively evaluating the effectiveness of the iterative automatic voting classification method.
The validity of the voting classification method is verified by comparing the independent use effect of the voting classification method designed by the invention and three classification methods (MNB, LR and MLP).
The model evaluates the data set. The performance of the automatic classification method for the data sets of Derby, Chromium and OpenStack is evaluated. The distribution of the initial labeled sample and the test sample (i.e., the target data set) of the three data sets is shown in table 1.
Table 1 distribution of performance evaluation experimental data sets
Figure BDA0002716230690000081
And evaluating the indexes. In order to effectively evaluate the performance of different classifiers, the multidimensional evaluation is carried out by using Recall, Precision, F1-score, Accuracy and a statistical analysis method. Specific descriptions of these indices are as follows.
For a sample data, there are four possible classification results:
-TP (true Positive): indicating that the positive sample prediction result is positive;
-FP (false Positive): indicating that the negative sample prediction result is positive;
-tn (true negative): indicating that the negative sample prediction result is negative;
-fn (false negative): indicating that the positive samples are predicted to be negative.
According to the classification result, each evaluation index and a specific calculation formula are as follows:
recall (Recall): the parameter represents the proportion of the positive samples in all the positive samples correctly predicted by the model, and the specific calculation formula is as follows:
Figure BDA0002716230690000082
accuracy (Precision): the parameter represents the proportion of the model to correctly predict the positive samples in all the predicted positive samples, and the specific calculation formula is as follows:
Figure BDA0002716230690000083
f1 value (F1-score): this parameter is a combination of the above mentioned fair share of accuracy and recall. Higher F1 values mean more accurate model prediction, and the specific calculation formula is as follows:
Figure BDA0002716230690000091
accuracy: indicating the true proportion of the prediction, also referred to as the success rate.
Figure BDA0002716230690000092
Statistical analysis method: to further assess whether the effect of this method is statistically more significant than other methods, a Wilcoxon signed rank test (Wilcoxon rank-sum test) was used to analyze the statistically significant differences between the different models. Cliff's delta was used to measure the amount of difference between two non-parametric variables by calculating the Effect size of different models of F1-score and Precision as follows:
Figure BDA0002716230690000093
where W is the Wilcoxon rank sum test, and m and n are the two variables that need to be compared to each other, respectively. The table for the importance and value of d is shown in table 2:
TABLE 2 level and Cliff's delta range correspondences
Rank of Cliff's delta range
Can be ignored |d|<0.147
Small 0.147≤|d|<0.33
Medium and high grade 0.33≤|d|<0.474
Big (a) 0.474≤|d|
(iii) results of the experiment
(a) And (3) automatically marking the result: the method provided by the invention is applied to the three data sets shown in the table 1, and the target data set is automatically marked. The three outputs for each item are shown in table 3. The prediction results of Derby, OpenStack and Chromium are that the proportion of the negative sample Bpneg accounts for 81.40, 99.39% and 98.72% of the total amount of the target data respectively; while the number of positive samples Bppos was predicted to be only 12.44, 0.16% and 1.15%. This positive and negative sample ratio is consistent with the fact that the ratio of security vulnerability related reports in the actual project is small. And for uncertain samples Bpu, the proportion of the samples is 6.16%, 0.45 and 0.13%.
TABLE 3 iterative voting classification results statistics for target data sets
Data set Bppos Ratio (%) Bpneg Ratio (%) Bpu Ratio (%)
Derby 107 12.44 700 81.40 53 6.16
OpenStack 129 0.16 88071 99.39 396 0.45
Chromium 143 1.15 40956 98.72 52 0.13
(b) And (3) performance index comparison: the Precision, F1-score and Accuracy indexes of the method are all superior to those of three classification methods which are used independently.
TABLE 3 Performance evaluation results of the Classification Algorithm
Figure BDA0002716230690000101
In order to evaluate the statistical significance of the method, the experiment is respectively carried out 30 times on three data sets of Derby, OpenStack and Chromim by adopting a Vote classification algorithm and three single classification algorithms, statistical analysis is carried out on the data at the beginning, p-value influence magnitude grades are respectively calculated by Wilcoxon rank sum test and Cliff's delta, and the results are shown in Table 4, so that the iterative voting classification method of the invention is remarkably superior to other three classifiers.
TABLE 4 comparison of different classification algorithms p-value and impact size rating
Figure BDA0002716230690000102
Figure BDA0002716230690000111

Claims (4)

1. A security vulnerability report data set construction method based on a voting mechanism is characterized by comprising the following steps:
step 1: initial training sample labeling, this phase comprising two inputs and two outputs:
inputting: CVE data and unlabeled sample Ball
And (3) outputting: labeled initial training sample set BlAnd the remaining unlabeled sample set Bu
Step 1.1: positive sample labeling based on CVE: marking the record associated with the CVE in the defect report of the software product as a 'positive' sample to finally obtain a set B of all marked positive samples of the software productpos(ii) a Recording marked as positive samples from the software product Defect report set BallRemoving to obtain the residual unlabeled sample set Bleft
Step 1.2: obtaining an initial label "negative" sample based on the levenstein distance: by calculating the remaining unmarked defect report BleftWith each record in (B) and marked positive sampleposThe first several records with the largest Levensan distance are extracted as the initialMarking the negative samples to form an initial marked negative sample set Bneg(ii) a Recording marked as negative examples from the software product Defect report set BleftThe remaining unlabeled sample set is used as the target sample set Bu
Step 1.3: marking the initial positive sample set BposAnd set of initial labeled negative examples BnegCombining to form an initial training sample set Bl
Step 1.4: output initial training sample set BlAnd unlabeled target sample set Bu
Step 2: iterative auto-vote classification: an iterative vote classification method is proposed, comprising 3 inputs and 3 outputs;
inputting: labeled training sample set Bl(ii) a Unlabeled target data set Bu(ii) a Three classifiers;
and (3) outputting: a sample set Bppos predicted to be positive; a sample set Bpneg predicted to be negative; an uncertain sample set Bpu;
step 2.1: model training: respectively training the three classifiers by using the marked training samples;
step 2.2: voting type target data automatic marking: respectively predicting target data through the three models trained in the step 2.1, transferring data marked as negative by the three classifiers simultaneously from the target data to a training sample, and expanding the number of negative samples in the training sample; judging whether an iteration exit condition is met, if so, entering a step 2.3; otherwise, entering step 2.1;
step 2.3: and (3) outputting the target data automatic marking result: extracting data of which the prediction results of the three classifiers are positive at the same time to form a positive sample set; continuously adding data of training samples in iteration to form a negative sample set; taking the residual data as an uncertain sample set; and respectively outputting the three sets to a specified file.
2. The method for constructing a security vulnerability report data set based on voting mechanism according to claim 1, wherein the step 1 isThe levenstan distance of (a) is one of the edit distances, referring to the least changing operation steps required to change from one string to another; the Levensan distance lev between two character strings a, ba,bThe calculation formula of (| a |, | b |) is shown in formula (1):
Figure FDA0002716230680000021
wherein,
Figure FDA0002716230680000022
is an index function when ai=bjWhen the temperature of the water is higher than the set temperature,
Figure FDA0002716230680000023
equal to 0; otherwise, equal to 1; leva,b(i, j) is the distance between the first i characters of string a and the first j characters of string b; i and j are indices with step size 1; by calculating the remaining unmarked defect report BleftWith each record in (B) and marked positive sampleposThe levenstein distance in between.
3. A voting mechanism-based security vulnerability report data set construction method according to claim 1, characterized in that the first 50 records in step 1.1 are selected as initial mark "negative" samples.
4. The method for constructing a security vulnerability report data set based on a voting mechanism according to claim 1, characterized in that the 3 classifiers in the step 2 are respectively polynomial naive Bayes MNB, logistic regression LR and multilayer perceptive neural network MLP.
CN202011074609.5A 2020-10-09 2020-10-09 Security vulnerability report data set construction method based on voting mechanism Pending CN112231706A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011074609.5A CN112231706A (en) 2020-10-09 2020-10-09 Security vulnerability report data set construction method based on voting mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011074609.5A CN112231706A (en) 2020-10-09 2020-10-09 Security vulnerability report data set construction method based on voting mechanism

Publications (1)

Publication Number Publication Date
CN112231706A true CN112231706A (en) 2021-01-15

Family

ID=74120177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011074609.5A Pending CN112231706A (en) 2020-10-09 2020-10-09 Security vulnerability report data set construction method based on voting mechanism

Country Status (1)

Country Link
CN (1) CN112231706A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343051A (en) * 2021-06-04 2021-09-03 全球能源互联网研究院有限公司 Abnormal SQL detection model construction method and detection method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019019860A1 (en) * 2017-07-24 2019-01-31 华为技术有限公司 Method and apparatus for training classification model
CN110532542A (en) * 2019-07-15 2019-12-03 西安交通大学 It is a kind of that recognition methods and system are write out falsely with the invoice for not marking study based on positive example

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019019860A1 (en) * 2017-07-24 2019-01-31 华为技术有限公司 Method and apparatus for training classification model
CN110532542A (en) * 2019-07-15 2019-12-03 西安交通大学 It is a kind of that recognition methods and system are write out falsely with the invoice for not marking study based on positive example

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴潇雪等: "CVE-assisted large-scale security bug report dataset construction method", ELSEVIER, 29 February 2020 (2020-02-29), pages 3 - 4 *
李霞;王连喜;蒋盛益;: "面向不平衡问题的集成特征选择", 山东大学学报(工学版), no. 03, 16 June 2011 (2011-06-16) *
郑炜;陈军正;吴潇雪;陈翔;夏鑫;: "基于深度学习的安全缺陷报告预测方法实证研究", 软件学报, no. 05, 15 May 2020 (2020-05-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113343051A (en) * 2021-06-04 2021-09-03 全球能源互联网研究院有限公司 Abnormal SQL detection model construction method and detection method
CN113343051B (en) * 2021-06-04 2024-04-16 全球能源互联网研究院有限公司 Abnormal SQL detection model construction method and detection method

Similar Documents

Publication Publication Date Title
CN109034368B (en) DNN-based complex equipment multiple fault diagnosis method
CN110532542B (en) Invoice false invoice identification method and system based on positive case and unmarked learning
CN109657947B (en) Enterprise industry classification-oriented anomaly detection method
CN111882446B (en) Abnormal account detection method based on graph convolution network
CN110866819A (en) Automatic credit scoring card generation method based on meta-learning
US7292960B1 (en) Method for characterization, detection and prediction for target events
CN110232395B (en) Power system fault diagnosis method based on fault Chinese text
CN107861951A (en) Session subject identifying method in intelligent customer service
CN113742733B (en) Method and device for extracting trigger words of reading and understanding vulnerability event and identifying vulnerability type
CN113420145A (en) Bidding text classification method and system based on semi-supervised learning
CN111177010B (en) Software defect severity identification method
CN114297393A (en) Software defect report classification method integrating multivariate text information and report intention
CN111832306A (en) Image diagnosis report named entity identification method based on multi-feature fusion
CN112231706A (en) Security vulnerability report data set construction method based on voting mechanism
Rofik et al. The Optimization of Credit Scoring Model Using Stacking Ensemble Learning and Oversampling Techniques
CN115098681A (en) Open service intention detection method based on supervised contrast learning
CN113505221B (en) Enterprise false propaganda risk identification method, equipment and storage medium
Navarro-Cerdan et al. Batch-adaptive rejection threshold estimation with application to OCR post-processing
CN113704073A (en) Method for detecting abnormal data of automobile maintenance record library
Wang et al. Shapelet classification algorithm based on efficient subsequence matching
Florbäck Anomaly detection in logged sensor data
AlSaif Large scale data mining for banking credit risk prediction
Turkoglu et al. Application of data mining in failure estimation of cold forging machines: An industrial research
CN113177831B (en) Financial early warning system constructed by application of public data and early warning method
CN116932487B (en) Quantized data analysis method and system based on data paragraph division

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination