CN110177112B

CN110177112B - Network intrusion detection method based on double subspace sampling and confidence offset

Info

Publication number: CN110177112B
Application number: CN201910490598.XA
Authority: CN
Inventors: 王喆; 陈立龙; 曹晨杰; 李冬冬; 杜文莉; 杨海
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2019-06-05
Filing date: 2019-06-05
Publication date: 2021-11-30
Anticipated expiration: 2039-06-05
Also published as: CN110177112A

Abstract

The invention provides a network intrusion detection method based on double subspace sampling and confidence offset, which comprises the following steps of firstly, carrying out down-sampling pretreatment of a sample and feature double layer on a base classifier of each layer; secondly, the confidence of each layer is mixed with the original characteristics by an interpolation method to be used as new characteristics to be input into the next layer of the model; and then, disturbing the confidence coefficient of the interpolation layer by layer through a cascade model. In the test step, the perturbation of the confidence will not participate. Compared with the traditional unbalanced classification integration method, the method has the advantages that the unbalanced problem is expanded by the deep forest, and the threshold problem in unbalanced classification is further solved through a cascade structure; the system generates a model with selectable disturbance amplitude to train the sample, so that the detection performance of the model for unbalanced network intrusion can be effectively improved; meanwhile, the integrated model stacked layer by layer can obtain more excellent generalization performance in the detection process.

Description

Network intrusion detection method based on double subspace sampling and confidence offset

Technical Field

The invention relates to an unbalanced network intrusion detection and identification method, belonging to the field of network information security

Background

With the rapid development of network technology and the gradual expansion of the scale of the internet, the network security problem gradually goes into the public sight. The research related to the network intrusion identification method is also a popular field in the year. The basic main network attack types include Denial of Service (DoS), unauthorized Remote host access (R2L), unauthorized super User access (User-to-Root, U2R), and snooping detection (Probing), and the above attack methods can further derive numerous sub-attack methods. It is therefore imperative to construct a targeted detection scheme for these network attacks.

The existing common network attack detection method comprises the following steps: 1) the rule-based detection method has high dependency on the existing rule database, cannot update new network attack means in time and is easy to cause huge loss; 2) the detection method depends on the network flow characteristic distribution, but the detection method excessively depends on the randomness, and partial attack means can be skillfully avoided; 3) intrusion detection methods based on machine learning, for example using support vector machines, random forests, neural networks, etc. The unknown network attack can be effectively and timely responded by using the machine learning-based method. But limited by different physical conditions and environmental restrictions, the number of network intrusions tends to be unbalanced in category, but the traditional machine learning method is difficult to solve the unbalanced type of network intrusions.

The unbalanced network intrusion detection problem can be effectively solved by using ensemble learning and combining with data sampling. These sampling-based integration methods can be further classified into bagging integration, boosting integration and hybrid integration strategies according to different integration strategies. These integration areas already have a number of representative algorithms. On the other hand, algorithms of the stochastic feature subspace are proposed to avoid underestimating some implicit important features and to filter some possible noisy features. The related algorithm is combined with a bagging integration strategy and a base classifier such as an SVM, and a representative algorithm ABRS-SVM and the like are provided.

The Zhou Shihua teacher provides a deep forest integration algorithm in 2017, and can compete with deep learning classification performance by using fewer hyper-parameters and lighter models. Meanwhile, the ideal classification effect can be achieved on a small number of data sets. The idea of the cascade forest is a design idea of model stacking, and generalization performance of the algorithm can be effectively improved.

However, the above ensemble learning method does not solve the threshold problem in the imbalance problem well, which results in that the classification performance cannot achieve the ideal effect when facing a data set with a high imbalance rate, such as network intrusion. And the cascade forest is used as a novel integrated model and has no optimization strategy aiming at the imbalance problem. Therefore, a new cascaded integration model is needed to promote the cascaded forest to the unbalanced problem and effectively solve the threshold problem in the existing unbalanced network intrusion detection.

Disclosure of Invention

Aiming at the problems that the existing integration algorithm can not effectively solve unbalanced network intrusion detection, the integration scale can not be well determined and modeling can only be carried out by experience, the invention provides a network intrusion detection method based on double subspace sampling and confidence offset by popularizing the characteristics of cascade forests. The integrated model effectively utilizes the model stacking structure of the cascade forest to adjust the classification threshold value of the imbalance problem layer by layer; and double downsampling data preprocessing is introduced to enable the model to effectively solve the problem of unbalanced network intrusion detection; and the verification mechanism can well control the scale of the model.

The technical scheme adopted by the invention for solving the technical problems is as follows: firstly, converting an acquired sample into a vector model which can be processed by the system according to specific problem description by a background, and carrying out one-hot coding on discrete features; secondly, the integrated model optimizes the classification performance of the unbalance problem through a double down-sampling strategy on a sample level and a feature level; according to the output confidence coefficient of the base classifier of the previous layer, carrying out feature disturbance on the base classifier of the previous layer, and mixing the base classifier with the original features to be used as the input of a model of the next layer; and a verification mechanism is added into the cascade model, so that the layer number can be adaptively stopped from increasing. In the testing process, data is substituted into the cascade model trained before, confidence offset is not needed, and the output of the last layer is used as a final result.

The technical scheme adopted by the invention for solving the technical problem can be further refined. In the first stage of the training step, the base classifier trained by each layer is a random forest and naive Bayes of a classical algorithm. The base classifier can be expanded more, and only the 2 types are selected as the base classifier in the experiment in consideration of the interpretability of the problem and the realization difficulty of the method. Meanwhile, in the testing and verifying process, the average accuracy of the majority classes and the minority classes is used as an evaluation index to objectively express the performance of the algorithm.

The invention has the beneficial effects that: the method comprises the steps of (1) designing a cascade integration model to popularize a cascade forest to the unbalanced field; the disturbance amplitude is controlled by adjusting the hyper-parameter eta, so that the model can effectively solve the classification problem in unbalanced classification.

Drawings

FIG. 1 is an overall flow chart of the present invention.

Detailed Description

The invention will be further described with reference to the following figures and examples: the system designed by the invention is divided into four modules.

A first part: data acquisition

In the data acquisition process, real sample data is transformed, and a data set represented by a vector is generated to facilitate the processing of a subsequent module. In this step, the collected sample is divided into a training sample and a test sample. The training samples are processed first. Generating a vector from a training sample

Wherein i represents that the sample is the ith of the total training sample, and c represents that the sample belongs to the c-th class. Each element of the vector corresponds to an attribute of a sample, and the dimension d of the vector is the number of attributes of the sample. To facilitate subsequent calculation, all training samples are combined into a training matrix X₀In the matrix, each row is a sample, where the subscript 0 denotes X₀Is the initial input.

A second part: training classification model

In this block, the training sample matrix X generated by the previous block₀Will be substituted into the inventive core algorithm for training. The method mainly comprises the following steps:

1) the base classifier used in the integration model is random forest and naive Bayes: in a random forest, a CART tree is used as a sub-classifier, and k features are randomly selected from d features to participate in the discrimination of Gini indexes every time leaf nodes of the CART are split, wherein k is generally

Gini index is calculated as follows

Wherein

Representing k feature subspaces F^kWherein the ith feature and v represent features

Is given by the value v, p_yIndicating the scale of the class y samples. The lower the Gini index, the better the classification performance of the feature. Naive bayes can be viewed as the simplest bayesian network classifier. There is a conditional independence assumption in naive Bayes with a decision equation of

Where P (y) represents the prior probability of the class y, P (x)_iY) then represents the conditional probability of the feature i in category y. Both random forests and naive bayes cannot reasonably deal with the imbalance problem because they are optimized on a global basis.

2) Training a random forest or naive Bayes base classifier of each layer by using a random down-sampling strategy based on sample and feature double layers: hypothesis training set X^FThe total number of samples is N, wherein the number of the minority class samples is N_pThe number of majority samples is N_n. In the double random down-sampling strategy, first, a majority class N 'equal to a minority class is randomly selected without being returned in a sample set'_n＝N_pWhile all minority classes are involved in training; then, for the feature space F, a different feature subspace F '(F' e F) is selected for training. This not only reduces the effect of unbalanced sample ratios, but also effectively filters the negative effects of some undesirable features. The specific algorithm steps are as follows

Where S and E are the number of integration times of the sample and feature sample, respectively, δ is the feature sampling rate, and RUS is the majority class with random downsampling equal to the minority class.

3) And (3) according to the output confidence coefficient of the last layer of base classifier, performing characteristic disturbance on the base classifier, and mixing the base classifier with the original characteristics to be used as the input of the next layer of model: the base classifier used by the cascade interpolation integration model is random forest and naive Bayes, wherein the confidence coefficient calculation mode of the Random Forest (RF) is

Can be intuitively understood as the average of the sample proportions of the belonged category y' in the leaf nodes. The confidence degree of the Naive Bayes (NB) is calculated in the way of

Representing the posterior probability of the category y'. To prevent overfitting, the base classifiers inside each layer are cross-validated by 3-fold to generate confidence. The confidence offset procedure through which the resulting confidence vector V passes is as follows

V′_l(i，y_majority)＝V_l(i，y_majority)×η

V′_l(i，y_minority)＝V_i(i，y_minority)/η，

Wherein eta is a hyper-parameter, the general value range is the neighborhood of 1, and the value range in the experiment is {0.85,0.9,0.95,1,1.05,1.1,1.15 }. It is clear that the confidence offset process is negligible when η is 1. From the above equation, the confidence for the majority classes is multiplied by the parameter η, and the confidence for the minority classes is divided by η. The bias weight of the majority class/minority class is dynamically adjusted layer by layer through the disturbance of the confidence level. Finally, the disturbed feature V' and the original feature are mixed and interpolated to be used as the input of the next layer model

Wherein X₀For the original feature, l is the current layer number, the sample number is m, the dimension is d, and the dimension of the interpolation confidence coefficient is N_classI.e. the number of categories.

4) A verification mechanism is added into a cascade model, so that the number of layers can be adaptively stopped from increasing, and the method is specifically realized as follows: the number of layers of the cascade interpolation model is limited by 2, and in the experiment, the maximum number of layers cannot exceed 5; second, each layer will perform a validation process with all layers before and after training is complete. Since the prior training was done by cross-validation, the validation process becomes more convincing. Here, the average accuracy (M-ACC) is used as the evaluation criterion for the verification

Wherein TPR is the accuracy of the minority class, and TNR is the accuracy of the majority class. If the verified M-ACC drops, the number of layers stops growing.

And a third part: testing unknown data

The module firstly takes the other half of samples randomly divided in the first module as test samples to form a test sample matrix, wherein a training set and a test need to meet the premise of the same probability distribution. Secondly, a model trained by the optimal over-parameter eta and the cascade layer number l is used in the testing process. And it is important to note that confidence migration is not required in the testing process, because it is the difference between the training set and the disturbance of the testing set that makes it sensitive enough to different classification thresholds, so that the unbalanced classification problem can be solved better.

Design of experiments

1) Selecting and introducing an experimental data set: KDD is short for Data Mining and Knowledge Discovery (Data Mining and Knowledge Discovery), and KDD CUP is an annual competition organized by SIGKDD (Special Interest Group on Knowledge Discovery and Data Mining) of ACM (Association for Computing machine). The KDD CUP 99 data set is a standard in the field of network intrusion detection, and lays a foundation for network intrusion detection research based on computational intelligence. Different kinds of network attack data have obvious imbalance phenomena in quantity, and the imbalance phenomena form a main factor influencing the classification performance. The experiment selected 5 unbalanced KDD Cup 99 datasets from the KEEL database. Respectively, 'land _ vs _ satan', 'side _ past _ vs _ satan', 'land _ vs _ portsweep', 'buffer _ overflow _ vs _ back' and 'rootkit-imap _ vs _ back'. The data information is shown in the following table, and the discrete features in the data are all represented by replacing one-hot.

All used data sets were checked with 5 rounds of cross-validation, i.e., the data sets were shuffled and equally divided into 5, 4 of which were used for training each time, 1 for testing, and a total of 5 rounds were performed. I.e., all data will be tested as a test set.

2) Comparing models: the system provided by the invention is named as CILDC, and the models based on the random forest are named as CILDC-RF respectively. In addition, we chose Random Forest (RF), dual subspace SVM (ABRS-SVM) and cost-sensitive based SVM (CS-SVM) as a comparison.

3) Parameter selection: the value range of a disturbance coefficient eta of the CILDC is {0.85,0.9,0.95,1,1.05,1.1 and 1.15}, the integration times of the double subspaces are all 5, the number of trees of the random forest is 50, the SVM uses an RBF kernel, the values of a relaxation coefficient C and a kernel radius sigma of the SVM are all {0.01,0.1,1,10 and 100}, and the characteristic sampling rates are all selected from {0.5,0.7 and 0.9}

4) The performance measurement method comprises the following steps: the experiments uniformly used the average accuracy M-ACC of the majority and minority classes as the evaluation criterion.

Results of the experiment

The M-ACC results for all models on each KEEL dataset are as follows. The last line in the table represents their average M-ACC and the black font represents the optimal result.

From the above table, it can be found that the CILDC-RF of the present invention can obtain the best results in most data sets, and the performance exceeds that of other comparison algorithms. Particularly, the precedence is obvious on a data set of 'rootkit-imap _ vs _ back'. In addition, the variance of the CILDC-RF is lower compared with other algorithms, which shows that the algorithm has more stable classification effect on KDD network attack data.

Claims

1. The network intrusion detection method based on double subspace sampling and confidence offset is characterized in that: the method comprises the following specific steps:

1) the first step of pretreatment: constructing a network attack characteristic through a network data acquisition tool, and converting the acquired sample set characteristic into a data matrix suitable for subsequent processing;

2) a second step of pretreatment: distinguishing continuous features and discrete features in original data, and performing one-hot conversion on all discrete features;

3) training a first step: training a random forest or naive Bayes base classifier of each layer by using a random down-sampling strategy based on a sample and feature double layer, wherein the details are as follows: suppose the total number of training set samples is N, wherein the number of minority class samples is N_pThe number of majority samples is N_n(ii) a For the ith sample integration, a total of S times are performed, and in the double random downsampling strategy, the majority class N 'equal to the minority class is selected randomly without being replaced in the sample set'_n＝N_pAnd all the minority classes participate in training to obtain the integrated sample in the feature space F after the ith sample samplingSample collection

And then E-time feature sampling integration is carried out, for j-th time feature sampling integration, different feature subspaces F 'are randomly selected for a feature space F, wherein F' belongs to F, | F | ═ F | × delta, and h is used_i，j(x) Training is performed, where S and E are the number of integration times of the sample and feature sample, respectively, and δ is the feature sampling rate,

is the sample set h after the ith sample sampling integration in the feature space F_i，j(x) The method is a base classifier after the ith sample sampling and the jth feature sampling, and the RUS is a majority class which is randomly sampled and is equal to a minority class;

4) and a second training step: performing feature disturbance on the output confidence coefficient of the base classifier of the previous layer according to the output confidence coefficient of the base classifier of the previous layer, and mixing the output confidence coefficient with the original features to be used as the input of a model of the next layer;

5) and a third training step: a verification mechanism is added into the cascade model, so that the number of layers can be adaptively stopped from increasing;

6) and (3) testing: and inputting the test data set into the obtained cascade model to finally obtain a detection classification result of the network intrusion.

2. The method of claim 1, wherein the network intrusion detection method based on dual subspace sampling and confidence offsets is characterized in that: and in the second training step, according to the output confidence coefficient of the last-layer base classifier, performing feature disturbance on the last-layer base classifier, and mixing the feature with the original features to be used as the input of the next-layer model, wherein the detailed description is as follows: the base classifier used by the cascade interpolation integration model is random forest and naive Bayes, wherein the confidence coefficient calculation mode of the Random Forest (RF) is

Can be intuitively understood as the mean value of the sample proportion of the belonged category y' in the leaf node, and the confidence degree of Naive Bayes (NB) is calculated in the way of

Representing the posterior probability of class y', the base classifier inside each layer is used to generate confidence by 3-fold cross validation to prevent overfitting, and the confidence offset process passed by the resulting confidence vector V is as follows

V′_l(i，y_majority)＝V_l(i，y_majority)×η

V′_l(i，y_minority)＝V_l(i，y_minority)/η，

Where 1 is the current number of layers, V_l(i，y_maiority) Is the probability, V, that a sample i in layer 1 belongs to the majority class_l(i，y_minority) The probability that a sample i in a layer belongs to a minority class, eta is a hyper-parameter, the general value range is a neighbor of 1, the confidence coefficient of a majority class is multiplied by a parameter eta, the confidence coefficient of a minority class is divided by eta, the bias weight of the majority class/the minority class is dynamically adjusted layer by layer through disturbance of the confidence coefficient, and finally the feature V 'after disturbance'_lWill be mixed with the original features and interpolated as the input of the next layer model

Wherein X₀The method has the characteristics that the method has the original characteristics,

is a real number set, l is the current layer number, the sample number is m, the dimensionality is d, and the dimensionality of the interpolation confidence coefficient is N_classI.e. the number of categories.

3. The method of claim 1, wherein the network intrusion detection method based on dual subspace sampling and confidence offsets is characterized in that: the third step of training, adding a verification mechanism in the cascade model to enable the number of layers to be self-adaptive and stop increasing, and the method is specifically realized as follows: the number of layers of the cascade interpolation model is limited by 2, and in the experiment, the maximum number of layers cannot exceed 5; secondly, each layer is subjected to a verification process after training is finished and before all the layers, and the average accuracy (M-ACC) is used as an evaluation standard of verification

Where TPR is the accuracy of the minority class and TNR is the accuracy of the majority class, the number of layers stops increasing if the verified M-ACC is somewhat degraded.

4. The method of claim 1, wherein the network intrusion detection method based on dual subspace sampling and confidence offsets is characterized in that: in the testing stage, a testing data set is input into the obtained cascade model, the characteristics of the cascade model do not need to be disturbed in the process of layer-by-layer interpolation, and the specific operation is as follows: the training set and the test need to satisfy the premise of the same probability distribution, and secondly, the optimal hyper-parameter eta and the model trained by the cascade layer number I are used in the test process.