CN114422269A

CN114422269A - Network security assessment method and system based on machine learning

Info

Publication number: CN114422269A
Application number: CN202210308554.2A
Authority: CN
Inventors: 胡维; 梁露露; 韩冰; 罗广超; 李季; 赵远杰; 陈幼雷; 陈晓峰; 李可
Original assignee: Beijing Yuanbao Technology Co ltd
Current assignee: Beijing Yuanbao Technology Co ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-04-29

Abstract

The invention discloses a network security assessment method and system based on machine learning, wherein an XGboost model is trained through multi-dimensional network security parameter historical data and a corresponding security score label, the trained XGboost model can be directly used for network security assessment, and does not need to rely on experts to assess network security, so that the technical problems that in the prior art, an expert assessment method is adopted to assess network security, the expert experience is relied on, the time cost is high, the efficiency is low, the reliability is low, and the network security assessment requirement under a large data environment is difficult to meet are solved. The safety score label is made through the evaluation of a plurality of experts, the data label used for training the XGboost model is finally obtained, and then the model used for network safety evaluation is obtained through training, so that the effect of simulating the grading of the plurality of experts is achieved, the one-sidedness and the limitation of the grading of a single expert are eliminated, and the reliability of the accuracy of the label is improved.

Description

Network security assessment method and system based on machine learning

Technical Field

The invention relates to the technical field of network security, in particular to a network security assessment method and system based on machine learning.

Background

With the development of information technology, the internet has a situation that the openness, the sharing performance and the interconnection degree are continuously expanded, and the network security problem is more and more severe.

The network security risk assessment is an important measure for protecting the information security of enterprises, and an expert scoring method is one of the existing network security risk assessment methods. However, the accuracy of the expert scoring method for network security risk assessment mainly depends on the reading experience of experts and the breadth and depth of rich knowledge, so that the experts participating in the assessment are required to have higher academic level and rich practical experience for the assessment system, have stronger subjectivity, high time cost, low efficiency and low reliability, and are difficult to meet the network security assessment requirement in a big data environment.

Disclosure of Invention

The invention provides a network security assessment method and system based on machine learning, which are used for solving the technical problems that in the prior art, an expert evaluation method is adopted for network security assessment, the network security assessment depends on expert experience, has strong subjectivity, high time cost, low efficiency and low reliability, and is difficult to meet the network security assessment requirement in a big data environment.

In view of this, the first aspect of the present invention provides a network security assessment method based on machine learning, including:

acquiring multi-dimensional network security parameter historical data and security score labels corresponding to each group of multi-dimensional network security parameter historical data;

training the XGboost model by using the multi-dimensional network security parameter historical data and the security score labels corresponding to the multi-dimensional network security parameter historical data to obtain a trained XGboost model;

and inputting the network security parameter data to be analyzed as variables into the trained XGboost model to perform network security risk scoring to obtain a network security scoring result.

Optionally, the multidimensional network security parameters include network security parameters, port security parameters, DNS security parameters, mail security parameters, patch vulnerability parameters, application security parameters, IP reputation parameters, asset exposure parameters, and data security parameters.

Optionally, the safety score labels corresponding to each group of multi-dimensional network safety parameter historical data are obtained by evaluating the network safety parameter historical data by an expert according to the influence degree of the network safety parameters.

Optionally, the obtaining of the multi-dimensional network security parameter historical data and the security score label corresponding to each group of multi-dimensional network security parameter historical data includes:

acquiring multi-dimensional network security parameter historical data;

carrying out data cleaning on the multi-dimensional network security parameter historical data;

extracting characteristic values of the historical data of the multidimensional network security parameters after data cleaning to form a characteristic vector consisting of the multidimensional network security parameters;

and obtaining a safety score label obtained by evaluating the characteristic vector by the expert according to the influence degree of the network safety parameter.

Optionally, the method further comprises:

the trained XGboost model was evaluated using F1-score as an evaluation index.

The invention provides a network security evaluation system based on machine learning in a second aspect, which comprises:

the data acquisition module is used for acquiring multi-dimensional network security parameter historical data and security score labels corresponding to each group of multi-dimensional network security parameter historical data;

the model training module is used for training the XGboost model by using the multi-dimensional network security parameter historical data and the security score labels corresponding to the multi-dimensional network security parameter historical data to obtain the trained XGboost model;

and the network security scoring module is used for inputting the network security parameter data to be analyzed as variables into the trained XGboost model to perform network security risk scoring to obtain a network security scoring result.

Optionally, the data obtaining module is specifically configured to:

acquiring multi-dimensional network security parameter historical data;

Optionally, the method further comprises:

and the model evaluation module is used for evaluating the trained XGboost model by using F1-score as an evaluation index.

According to the technical scheme, the network security evaluation method and system based on machine learning provided by the invention have the following advantages:

according to the network security assessment method and system based on machine learning, the XGboost model is trained through multi-dimensional network security parameter historical data and the corresponding security score labels, the trained XGboost model can be directly used for network security assessment, and does not need to rely on experts to assess network security any more, so that the technical problems that in the prior art, an expert assessment method is adopted to assess network security, the expert experience is relied on, the method and system provided by the invention have strong subjectivity, high time cost, low efficiency and low reliability, and the network security assessment requirements under a large data environment are difficult to meet are solved.

Meanwhile, in the network security assessment method and system based on machine learning, the safety score label is formulated through the assessment of a plurality of experts, the data label used for training the XGboost model is finally obtained, and then the model used for network security assessment is obtained through training, so that the effect of simulating multi-expert scoring is achieved, the sidedness and limitation of scoring of a single expert are eliminated, and the reliability of the accuracy of the label is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other related drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flow chart of a network security assessment method based on machine learning according to the present invention;

FIG. 2 is a schematic frame diagram of a network security assessment method based on machine learning according to the present invention;

fig. 3 is a schematic structural diagram of a network security evaluation system based on machine learning according to the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For easy understanding, please refer to fig. 1 and fig. 2, an embodiment of a network security assessment method based on machine learning is provided in the present invention, including:

step 101, obtaining multi-dimensional network security parameter historical data and security score labels corresponding to each group of multi-dimensional network security parameter historical data.

It should be noted that the parameters affecting the network security have multiple dimensions, and the invention considers the comprehensiveness of the parameters affecting the network security, and includes network security parameter data of 9 dimensions, which are respectively a network security parameter, a port security parameter, a DNS security parameter, a mail security parameter, a patch vulnerability parameter, an application security parameter, an IP reputation parameter, an asset exposure parameter, and a data security parameter. For example, when the evaluation dimension is a network security dimension, the corresponding indicator factors may include: detecting that the digital certificate has been revoked, the SSL/TLS protocol uses an unsecured suite of algorithms, and the like. When the evaluation dimension is the port security dimension, the corresponding index factors may include: an Elasticsearch service is detected, a Redis service is detected, etc. When the evaluation dimension is a DNS security dimension, the corresponding index factors may include: open DNS recursive resolution service is detected, DNS domain transmission holes are detected, and the like. When the evaluation dimension is the mail security dimension, the corresponding index factors may include: SMTP service fails reverse DNS resolution, SMTP service does not enable TLS, etc. When the evaluation dimension is a patch vulnerability dimension, the corresponding index factors may include: SQL injection vulnerabilities, XSS vulnerabilities, and the like. When the evaluation dimension is the application security dimension, the corresponding indicator factors may include: the website does not enforce the application of HTTPS, and the website does not set Content-Security-Policy and the like. When the evaluation dimension is a data security dimension, the corresponding index factors may include: and detecting SVN or GIT information leakage, suspected sensitive file leakage and the like. When the assessment dimension is an asset exposure dimension, the corresponding indicator factors may include: code management background exposure, web application component background exposure, and the like. When the evaluation dimension is an IP reputation dimension, the corresponding indicator factors may include: detection of P2P network activity, detection of malware events, and the like.

The parameters of each dimension can be divided into three levels of high, medium and low according to the severity degree of influencing the network security, and a plurality of characteristic indexes are arranged under each level. The target enterprise network is scanned by using a scanning technology, and the number of problems found by scanning corresponding to the characteristic indexes under each dimension, namely the characteristic value, can be obtained. And after acquiring the historical data of the multi-dimensional network security parameters, cleaning the historical data. The expert can grade the washed network security parameter historical data, namely the characteristic vector formed by characteristic values corresponding to each group of multidimensional network security parameter historical data according to the influence degree of the network security parameter. For example, each group of multi-dimensional network security parameter historical data includes 102 characteristic indexes, wherein the network security dimension has 30 indexes, 10 high-risk indexes, 8 medium-risk indexes and 12 low-risk indexes. According to the result of the scanning,

the number of problems corresponding to the first high-risk index is shown,

indicating the number of problems corresponding to the second high risk indicator, …,

representing the number of problems corresponding to the tenth high-risk index;

indicating the number of problems corresponding to the first intermediate risk indicator, …,

representing the eighth intermediate risk indicator pairThe number of problems to be solved;

indicating the number of problems corresponding to the first low risk indicator, …,

representing the number of problems corresponding to the twelfth low-risk index;

and expressing the number of problems corresponding to the first high-risk index in the port safety dimension. For other dimensions, and so on. Corresponding feature vectors can finally be generated

And a corresponding safe score label z.

And 102, training the XGboost model by using the multi-dimensional network security parameter historical data and the security score labels corresponding to the multi-dimensional network security parameter historical data to obtain the trained XGboost model.

It should be noted that the XGBoost model is trained by using the multi-dimensional network security parameter historical data and the security score labels corresponding to the multi-dimensional network security parameter historical data. Defining learning rate, iteration rounds, maximum tree (regression tree) depth, feature sampling per tree (one feature split point per tree), sample sampling, and regularization coefficients. Each iteration produces a regression tree, each iteration depends on the parameters of the previous tree, i.e. the parameters of the current regression tree are the parameters of the previous tree plus the newly trained residual, and the square loss function is made to be

Wherein, in the step (A),

in the form of an actual value of the value,

is a predicted value. The XGboost objective function is:

wherein the content of the first and second substances,nas to the number of samples,

is as followsiThe corresponding loss of the sample of the strip,

as a regularization term, i.e. alltThe complexity of the trees is summed.

Starting from a tree with the depth of 1, enumerating all features for a current node from a root node for each tree, sorting samples belonging to the current node according to feature values (namely sorting according to the size of all possible values of each feature, if the number of problems is possibly 0,1,2, then sorting according to the sequence of 0,1, 2.), determining an optimal splitting point of the feature through information gain, and selecting the splitting point to traverse each sorted feature in a greedy manner, wherein the left side of the feature is the left side of the feature

On the right side are

The following gains are calculated:

wherein the content of the first and second substances,

indicating the corresponding loss of the current split point,Iin order to be a set of characteristics,LandRrespectively represent a left sub-tree and a right sub-tree,gandhfirst derivatives of Taylor expansion terms of functions respectively corresponding to previous treesAnd the second derivative of the first and second order,

and

is a regularization parameter that represents the complexity of the model.

And selecting the most profitable feature as a splitting feature, and splitting by using the optimal splitting point of the feature. And selecting the tree with the maximum profit as a model tree. The XGBoost model training process is to obtain a plurality of model trees through the iteration (a root node of each tree is a split point corresponding to the current feature, and each non-leaf node is also a split point). The specific learning process can be formalized as:

wherein, in the step (A),tfor the number of training rounds at present,

as a function of the previous round of training,

for a new function to be trained, initially,

. The final learned parameter is the sum of the parameters corresponding to each tree, i.e.

. When prediction is carried out, the characteristics are introduced and calculated

The value is the score to be predicted.

And 103, inputting the network security parameter data to be analyzed as variables into the trained XGboost model to perform network security risk scoring to obtain a network security scoring result.

After the XGboost model is trained, the trained XGboost model can be directly used for network security assessment.

According to the network security assessment method based on machine learning, the XGboost model is trained through multi-dimensional network security parameter historical data and corresponding security score labels, the trained XGboost model can be directly used for network security scoring, and machine learning training is introduced to adaptively generate model parameters on the basis of expert scoring data, so that the network security assessment model is formed, and the network security assessment method does not need to rely on experts to assess network security any more.

In one embodiment, after the trained XGBoost model is obtained, the trained XGBoost model may also be evaluated using F1-score as an evaluation index. The mathematical representation of F1-score is:

wherein the content of the first and second substances,

in order to be able to predict the accuracy,

is the recall value.

Defining TP as correct prediction answer, FP as wrong to predict other classes as this class, and FN as predicted by this class label as other class labels.

The quality of the model can be evaluated by calculating the value of F1-score, and if the F1-score does not meet the requirement, the model parameters need to be adjusted for retraining. Therefore, the evaluation accuracy of the XGboost model can be ensured.

For easy understanding, please refer to fig. 3, an embodiment of a machine learning-based network security assessment system according to the present invention includes:

The multidimensional network security parameters include network security parameters, port security parameters, DNS security parameters, mail security parameters, patch vulnerability parameters, application security parameters, IP reputation parameters, asset exposure parameters, and data security parameters.

And the safety score labels corresponding to each group of multi-dimensional network safety parameter historical data are obtained by the experts according to the influence degree of the network safety parameters on the network safety parameter historical data.

The data acquisition module is specifically configured to:

acquiring multi-dimensional network security parameter historical data;

Further comprising:

According to the network security evaluation system based on machine learning, the XGboost model is trained through multi-dimensional network security parameter historical data and the corresponding security score labels, the trained XGboost model can be directly used for network security risk scoring, and the model parameters are adaptively generated by machine learning training introduced on the basis of expert scoring data, so that the network security evaluation model is formed, the network security risk is not required to be evaluated by depending on an expert, and the technical problems that in the prior art, the network security risk evaluation is carried out by adopting an expert scoring method, the network security risk evaluation depends on expert experience, the network security evaluation model has strong subjectivity, high time cost, low efficiency and low reliability, and the network security evaluation requirements under a big data environment are difficult to meet are solved.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A network security assessment method based on machine learning is characterized by comprising the following steps:

2. The machine-learning-based network security assessment method according to claim 1, wherein the multidimensional network security parameters comprise network security parameters, port security parameters, DNS security parameters, mail security parameters, patch vulnerability parameters, application security parameters, IP reputation parameters, asset exposure parameters and data security parameters.

3. The machine learning-based network security assessment method according to claim 1, wherein the security score labels corresponding to each set of multi-dimensional network security parameter historical data are obtained by an expert evaluating the network security parameter historical data according to the influence degree of the network security parameters.

4. The machine learning-based network security assessment method according to claim 3, wherein obtaining multi-dimensional network security parameter historical data and security score labels corresponding to each set of multi-dimensional network security parameter historical data comprises:

acquiring multi-dimensional network security parameter historical data;

5. The machine learning-based network security assessment method according to claim 1, further comprising:

the trained XGboost model was evaluated using F1-score as an evaluation index.

6. A machine learning-based network security assessment system, comprising:

7. The machine-learning based network security assessment system according to claim 6, wherein the multidimensional network security parameters comprise network security parameters, port security parameters, DNS security parameters, mail security parameters, patch vulnerability parameters, application security parameters, IP reputation parameters, asset exposure parameters and data security parameters.

8. The machine learning-based network security assessment system according to claim 6, wherein the security score labels corresponding to each set of multi-dimensional network security parameter historical data are obtained by an expert evaluating the network security parameter historical data according to the influence degree of the network security parameters.

9. The machine-learning-based network security assessment system of claim 8, wherein the data acquisition module is specifically configured to:

acquiring multi-dimensional network security parameter historical data;

10. The machine-learning-based network security assessment system according to claim 6, further comprising: