CN110719279A

CN110719279A - Network anomaly detection system and method based on neural network

Info

Publication number: CN110719279A
Application number: CN201910953413.4A
Authority: CN
Inventors: 张钧桓; 任涛; 刘子瑜; 杨可舟; 丁匀泰
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-10-09
Filing date: 2019-10-09
Publication date: 2020-01-21

Abstract

The invention provides a network anomaly detection system and a detection method based on a neural network. The detection system comprises an encoding processing module, a data normalization module, a feature selection module, an accuracy rate module and an observer operating characteristic curve drawing module, wherein the detection method comprises the steps of firstly, carrying out one-hot encoding processing on discrete features in a KDDCUP99 data set to form numerical values, then carrying out feature processing by adopting Min-Max, carrying out dimension reduction processing, inputting the dimension reduction processing into an MLPClasifier multilayer perceptron classifier to obtain a prediction result, finally inputting into the observer operating characteristic curve drawing module to draw an ROC curve, adopting a multilayer perceptron neural network, preventing overfitting by an L2 regularization method, adjusting hidden nodes, continuously training and debugging by adopting a cross validation method, comparing with KNN and SVM, and verifying the superiority of the invention in terms of running time and accuracy rate.

Description

Network anomaly detection system and method based on neural network

Technical Field

The technology relates to the field of neural networks, in particular to a network anomaly detection system and a network anomaly detection method based on a neural network.

Background

Due to the wide application of computer networks, detection of network attacks and protection of information security become inevitable, a great deal of threats are caused by the massive use of computer systems, and various types of attacks, such as zero-day vulnerability attacks, are caused by the wide spread of networks. The development of computer networks has greatly exacerbated computer security problems, particularly in today's network environments and advanced computing devices, where network administrators must now deal with the massive intrusion of individuals and large botnets with malicious intent, even though the internet protocol suite is not designed for security problems. According to the Sametak network security threat report, the malicious software attacks reported in 2010 exceed 30 hundred million times, the number of the denial of service type attacks in 2013 is obviously increased, and according to the data leakage investigation report of the Wildison company 2014, a hacker implements 63437 security holes. The global information security survey in 2015 also indicates that the number of incidents is increased, security incidents are increased year by year, and the magnitude of the incidents is greatly increased in recent years, so that the detection of network attacks becomes a serious concern today. In addition, as cracking tools are easier to use, the professional skills required for cyber crime are also reduced.

Anomaly detection is an important data analysis task that can detect anomalies or anomalous data from a given data set. This is an interesting field of data mining research, as it involves discovering engaging and rare patterns of data. The ever-changing characteristics of network attacks require a flexible defense system, and with the continuous maturity of the technology, the ways for realizing the network attack intrusion detection system are more and more diverse.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a network anomaly detection system and a detection method based on a neural network. The invention carries out analysis and data processing based on KDDCup99 data set, realizes and optimizes network intrusion detection based on neural network, and compares the network intrusion detection with a network intrusion detection method based on SVM and KNN.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

a network anomaly detection system based on a neural network comprises a data preprocessing unit and a data unit for identifying network attack anomalies, wherein the data preprocessing unit is used for carrying out coding processing, data normalization and feature selection on numerous disordered original data in a KDDCup99 data set so as to remove redundancy and noise, the data unit for identifying network attack anomalies is used for identifying anomaly attack categories including DOS, R2L, U2R and PROBING on the KDDCup99 data set.

The data preprocessing unit comprises an encoding processing module, a data normalization module and a feature selection module, wherein the encoding processing module is used for encoding non-numerical data in numerous disordered original data in a KDDCup99 data set, converting the non-numerical data in the original data into numerical data, compressing each line of data in all the numerical data into a numerical value between [0 and 1] through the normalization module to obtain a KDDCup99 data set in an interval with an output range of [0 and 1], and finally reducing the dimensionality of the data in the data set through the feature selection module by using the normalized KDDCup99 data set and screening to obtain main component data;

the coding processing module is used for converting non-numerical data in the KDDCup99 data set into numerical data;

the data normalization module is used for performing normalization processing on the numerical data by adopting a MinMax method to obtain a KDDCup99 data set with an output range of [0,1] interval;

the characteristic selection module is used for reducing the dimensionality of the data after normalization processing and reducing data redundancy and noise, so that the main components of the processed data are independent.

The data unit for identifying the network attack abnormity comprises an accuracy rate module and an observer operating characteristic curve drawing module, a KDDCup99 data set which is output by the data preprocessing unit and subjected to dimensionality reduction processing is input into an MLPClasifier multi-layer perceptron classifier, then a prediction result of each piece of data in the KDDCup99 data set is obtained by continuously adjusting hyper-parameters, the accuracy rate and the recall rate of a network abnormity detection system based on a neural network are obtained according to the prediction result, finally the prediction result is input into the observer operating characteristic curve drawing module, an ROC curve is drawn according to the obtained accuracy rate and recall rate, and the difference between the predicted value and the true value of the data in the KDDCup99 data set is obtained through the ROC curve in a visual mode;

the accuracy module is used for comparing the prediction result with the original data, calculating the accuracy and the recall rate of the network abnormity detection system based on the neural network, and obtaining the detection result of the network abnormity detection system based on the neural network;

the observer operation characteristic curve drawing module is used for drawing an ROC curve for the detection result of the network abnormality detection system based on the neural network, visually displaying data in a KDDCup99 data set, and visually displaying the difference between a predicted value and a true value obtained through the network abnormality detection system based on the neural network.

A detection method of a network anomaly detection system based on a neural network comprises the following steps:

step 1: inputting the original data in the KDDCup99 data set into an encoding processing module, performing one-hot encoding on all non-numerical data in the original data, and converting all non-numerical data in the original data into numerical data;

step 2: inputting numerical data in the original data and numerical data obtained by coding into a data normalization module, and performing normalization processing by adopting a MinMax method to obtain a KDDCup99 data set in an output range of [0,1 ];

and step 3: inputting the normalized KDDCup99 data set into a feature selection module, performing dimensionality reduction processing on the data in the normalized KDDCup99 data set, firstly calculating the variance value of the data in the normalized KDDCup99 data set under a specified dimensionality reduction threshold, then defining the data larger than the dimensionality reduction threshold as principal component data, and finally adding a feature to each obtained principal component data, wherein the feature is that the normal principal component data is marked as 1, and the abnormal principal component data is marked as-1;

and 4, step 4: inputting the KDDCup99 data set subjected to the dimensionality reduction processing into an MLPClasifier multi-layer perceptron classifier, continuously adjusting hyper-parameters to obtain a prediction result of each piece of data in the KDDCup99 data set, wherein the prediction result comprises normal data and abnormal data in abnormal attack categories, obtaining the correct rate of the prediction result according to the ratio of the number of the normal data to the total number of original data in the KDDCup99 data set, and obtaining the error rate of the prediction result according to the ratio of the number of the abnormal data to the total number of the original data in the KDDCup99 data set;

and 5: inputting the correct rate and the error rate into an accuracy rate module to obtain the accurate rate and the recall rate of the network abnormality detection system based on the neural network;

step 6: and inputting a prediction result obtained by detecting the KDDCup99 data set original data through a network anomaly detection system based on a neural network into an observer operating characteristic curve drawing module, drawing an ROC curve according to the obtained precision rate and recall rate, and intuitively obtaining the difference between the predicted value and the actual value of the KDDCup99 data set data through the ROC curve.

The MLPClasifier multi-layer perceptron classifier in the step 4 is designed based on a neural network algorithm, and the specific outline is as follows:

the MLPClasefi multi-layer perception machine classifier based on the neural network algorithm is designed with five hidden layers, the input of the first hidden layer is a KDDCup99 data set after dimension reduction processing, the input of the second hidden layer to the fourth hidden layer is output data of the previous hidden layer, the activation function of each hidden layer adopts a RELU function and adopts an L2 regularization mode to prevent overfitting, the regularization parameter is set to be 0.0001, the output layer adopts a linear function, the output of the last hidden layer is used as the input of the output layer to obtain the probability of each neural network state of each piece of data, the neural network state comprises a normal state and four abnormal states of DOS, R2L, U2R and PROBING, and the neural network state corresponding to the maximum probability is output, and the neural network state corresponding to the maximum probability is the prediction result of the classifier.

In the step 5, the accuracy and the error rate are input to an accuracy module to obtain the accuracy and the recall rate of the network anomaly detection system based on the neural network, which is specifically expressed as:

inputting the correct rate and the error rate into an accuracy rate module, counting the number of the normal data as m, counting the number of the original data in the normal data as n, and calculatingObtaining the precision rate of the network anomaly detection system based on the neural network, counting the number of normal data samples in the original data as s, and comparing the normal data samples with the sample number of the normal data samples

The size of the network anomaly detection system is obtained according to the recall rate of the network anomaly detection system based on the neural network.

The invention has the beneficial effects that:

compared with the traditional KNN (K neighbor) algorithm and SVM (support vector machine) algorithm, the network anomaly detection algorithm based on the neural network has the advantages that the accuracy of the network anomaly data prediction is higher, the running time is shorter, and the detection efficiency is greatly improved.

Drawings

Fig. 1 is a flowchart of a detection method of the neural network based network anomaly detection system in this embodiment.

Fig. 2 shows the types and numbers of the parts of the flag feature in the KDDCUP99 data set in this embodiment.

Fig. 3 shows the result of processing all the characteristics of a piece of data in the KDDCUP99 dataset in this embodiment.

Fig. 4 is a diagram illustrating a ratio of variance values of dimensions to total dimensions after data projection without dimension reduction of the KDDCUP99 data set in the present embodiment.

Fig. 5 is a diagram illustrating a ratio of variance values of each dimension to a total dimension after data projection is performed on the KDDCUP99 data set in this embodiment after setting the dimensionality reduction threshold to 0.999.

Fig. 6 is a ROC graph of the network anomaly detection system based on the neural network in the present embodiment.

Fig. 7 is a ROC graph based on KNN and SVM in the present embodiment.

Detailed Description

In the following detailed description of the technical solution of the present invention, with reference to the accompanying drawings and specific embodiments, first, data preprocessing is performed on KDDCup99 data sets, and data processing is performed on discrete features and continuity features, then PCA dimension reduction is performed on the data, then a neural network is used to perform anomaly detection on the data, and the data is compared with KNN (K nearest neighbor algorithm) and SVM (support vector machine algorithm). Several common evaluation indexes are adopted, mainly including accuracy, precision, recall, F-score and ROC curve.

A detection system of a network anomaly detection system based on a neural network comprises a data preprocessing unit and a data unit for identifying network attack anomalies, wherein the data preprocessing unit is used for carrying out coding processing, data normalization and feature selection on numerous disordered original data in a KDDCup99 data set so as to remove redundancy and noise, and the data unit for identifying network attack anomalies is used for identifying anomaly attack categories on the KDDCup99 data set, wherein the anomaly attack categories comprise four categories of DOS (denial of service attack), R2L (illegal access from a remote machine), U2R (illegal access of a common user to local super user privileges) and PROBING (monitoring and other detection activities).

the characteristic selection module is used for reducing the dimensionality of the data after normalization processing and reducing data redundancy and noise, so that the principal components of the processed data are independent of each other.

A method for detecting a network anomaly detection system based on a neural network, such as the flow chart of the detection method of the network anomaly detection system based on the neural network shown in fig. 1, includes the following steps:

step 1: the method comprises the steps of inputting original data in a KDDCup99 data set into an encoding processing module, carrying out one-hot encoding on all non-numerical data in the original data, and converting all non-numerical data in the original data into numerical data, wherein in the embodiment, 6 numerical features and 3 non-numerical features are included in discrete features in a KDDCup99 data set, and one-hot encoding processing is directly adopted for the numerical discrete features, and values of the numerical discrete features are only selected from 0 or 1, so that 12 features are obtained after the one-hot encoding processing is finished, and the types and the number of protocol features in a KDDCUP99 data set are obtained by carrying out text processing on files, wherein 283602 protocols of icmp types, 190065 protocols of tcp types and 20354 protocols of udp types are obtained.

The service features have 66 types, wherein the flag types are 11, the exec types are 99, the name types are 98, the kshell types are 98, the ctf types are 97, the netstat types are 95, the Z39-50 types are 92, the IRC types are 43, and the partial types and the number of the flag types are shown in FIG. 2.

The non-numerical discrete features have 3, 66 and 11 different values, so that the non-numerical discrete features after processing become 80 features.

Step 2: inputting numerical data in the original data and numerical data obtained by coding into a data normalization module, performing normalization processing by adopting a preprocessing. MinMaxScalter method in a sklern library to obtain a KDDCup99 data set with an output range of [0,1], changing the original 41 features of the data into 124 features until now, and totally having 124 feature values after processing, wherein the processed data result is shown in figure 3.

first, principal component processing is performed on data, and only the data is subjected to projection processing without dimension reduction, and the variance distribution of each dimension after projection is observed as shown in fig. 4.

After SVD is decomposed, a designated dimension reduction threshold value n _ components is set to be 0.9999, 72 features are obtained, the components of the attribute values of the 72 features exceed 0.9999 of all the feature attribute values, data containing the 72 features are data after dimension reduction, when the designated dimension reduction threshold value n _ components is 0.99, namely the designated principal component accounts for 99%, the principal component variance and the proportion are obtained, the principal component variance values and the proportion of the total variance are obtained, and it can be seen that 17 attribute features exist after dimension reduction of the data, and the components of the 17 attribute features exceed ninety-nine percent.

When the designated dimension reduction threshold n _ components is set to be 0.9999, 72 features exist, the principal component variance is obtained, the ratio of each principal component variance value to the total variance after dimension reduction is shown in fig. 5, and after dimension reduction processing is performed on data by using the designated dimension reduction threshold n _ components of 0.9999, classification can be better achieved.

the MLPClasifier multi-layer perceptron classifier is designed based on a neural network algorithm, optimizes a loss function through LBFGS or random gradient descent, and concretely comprises the following steps:

In the experiment, an ideal model effect is obtained by adjusting a series of hyper-parameters of the MLPClasefi multi-layer perceptron classifier, wherein alpha is a float type parameter, MLP is regularization supporting, alpha is a regularization item parameter, namely an L2 parameter, and a default value is set to be 0.001.

When the solvent is set to be the random gradient descending, the hidden _ layer _ sizes is set to be 5, the alpha is set to be 1e-5, the random _ state is set to be 1, the early positioning is set to be True, and the other values are default values, the evaluation index result of the common model is obtained.

And 5: inputting the accuracy and the error rate into an accuracy module to obtain the accuracy and the recall rate of the network anomaly detection system based on the neural network, which is specifically expressed as follows:

inputting the accuracy and error rate into an accuracy moduleCounting the number of the normal data as m, counting the number of the original data in the normal data as n, and calculating

Obtaining the precision rate of the network anomaly detection system based on the neural network, counting the number of normal data samples in the original data as s, and comparing the normal data samples with the sample number of the normal data samples

Step 6: inputting a prediction result obtained by detecting the KDDCup99 data set raw data through a network anomaly detection system based on a neural network into an observer operating characteristic curve drawing module, drawing an ROC curve according to the obtained precision rate and recall rate, and visually obtaining the difference between a predicted value and a true value of the KDDCup99 data set data through the ROC curve, wherein the detection result obtained by the network anomaly detection system based on the neural network in the embodiment is as follows: acc (accuracy) is 0.920152140154, the micro average accuracy is 0.920152140154, the macro average accuracy is 0.854771813775, the micro average recall is 0.920152140154, the macro average recall is 0.947451480342, F1 is 0.924726460069, AUC (area Under Curve) is 0.979521219622, and the ROC graph is shown in FIG. 6.

And then, carrying out anomaly detection on the data by using the KNN, the SVM and the neural network respectively, wherein a plurality of commonly used evaluation indexes are adopted, and the evaluation indexes mainly comprise accuracy, precision, recall rate, F-score and ROC-AUC curves.

Both positive and negative are labels referring to categories, FN indicates true positive, and prediction negative; TP represents true positive, while predicted positive; TN represents predicted negative, and true negative; FP indicates prediction positive and true negative.

The precision rate is also called precision rate, which is defined as:

in the formula, P represents the accuracy rate, TP represents true positive and prediction positive, FP represents prediction positive and true negative;

recall, also known as recall, is defined as:

wherein R represents recall, TP represents true positive and prediction positive, FN represents true positive and prediction negative;

since recall ratio and precision ratio are a pair of contradictory indexes, in order to solve the problem, F is introduced₁(F₁Score) value, formula defined as:

in the formula, F₁A metric representing the classification problem, TP represents true positive and prediction positive, FN represents true positive and prediction negative, and FP represents prediction positive and true negative.

The ROC is called Receiver Operating characterization, the curve is usually used by people for comparison among different binary classifiers, firstly, a P-R curve is introduced, the P-R curve takes the recall ratio as the horizontal axis and the precision ratio as the vertical axis, and the method comprises the following steps: firstly, expected results of all test samples are obtained through a learner, all result sets are sorted in a descending order, so that the situation that the samples are positive in front and negative in back can exist, then the samples are processed according to the positive class according to the order, and the current two rates, namely the recall ratio and the precision ratio, are calculated.

The vertical axis of ROC is TPR, i.e., the true rate, so a larger value of TPR means that more actual positive classes are included in the samples of the predicted positive class; the ROC horizontal axis is FPR, i.e. false positive rate, the larger FPR means the more actual negative classes in the samples of the predicted positive classes, obviously the ideal target is that TPR is as large as possible or even close to 1, if FPR approaches 0, then the curve shows continuous approaching to the vertical axis (0,1) point and increasingly deviating from the diagonal, and the better the effect is.

The comparison effect of the abnormal detection result obtained by comparing the detection method of the network abnormal detection system based on the neural network with the KNN algorithm and the SVM algorithm is shown in fig. 7.

Network intrusion detection based on KNN:

setting the value of k, namely n _ neighbors, as 5 by using a Kneighbors classifier function in skleern, namely, looking at that more normal points or more abnormal points exist in 5 points closest to a sample to be measured, P represents a distance measurement mode, P2 represents an Euclidean distance, P1 represents a Manhattan distance, wherein the Euclidean distance is selected, and other parameters in the function are set as default values, so that the accuracy, the precision rate, the recall rate and the F rate can be seen₁The results of the evaluation indices such as values commonly used are: acc (accuracy) is 0.921520787357, micro average accuracy is 0.921521787357, macro average accuracy is 0.856794414575, micro average recall is 0.921521787357, macro average recall is 0.944986709142, F₁0.925806386238, the ROC curve is shown in FIG. 7 (a).

Network intrusion detection based on SVM:

the SVM mainly has three parameters in a sklern library, wherein C is a penalty term and is the tolerance degree of errors, and a kernel function can be Linear, poly and RBF, and is commonly used as RBF; gama is a parameter of the RBF function selected as the kernel, and the function has more support vectors as the function is smaller, and less support vectors as the function is larger, so that the ama determines the distribution of the data after being converted into a new space, and the number of the distribution affects the speed of the whole process, wherein the experimental parameter is selected to be: the results of the commonly used evaluation indexes of accuracy, precision, recall, F1 values, etc. can be found in { ' C ':1.0, ' cache _ size ':200, ' class _ weight ': None, ' coef0':0.0, ' decision _ function _ shape ': ovr ', ' default ':3, ' gamma ': auto ', ' kernel ': rbf ', ' max _ iter ':1, ' probability ': False ', ' random _ state ': None, ' sharpening ': True ', ' tol ':0.001, ' upside ': False }, trained through the training data set, and the model is used to predict the KDDCup99 data set to see the following results: acc (accuracy) is 0.921521787357, micro average accuracy is 0.921521787375, macro average accuracy is 0.856794414575, micro average recall is 0.921521787357, macro average recall is 0.944986709142, F₁0.925806386238, wherein the ROC graph is shown in FIG. 7 (b).

Comparing fig. 6 with fig. 7, it can be seen that, from the AUC value of the ROC curve, the AUC value of the neural network is 0.9795, the AUC value of the SVM is 0.9669, and the AUC value of KNN is 0.9581, the network anomaly detection based on the neural network performs better, and from the cost of network intrusion detection time, the network anomaly detection based on KNN and SVM has a longer running time than the running time of the neural network model, where SVM is difficult to converge and the running time is very long, and in the experiment, it can be obviously observed that the running time is long, and the SVM is not suitable for processing a particularly large data set.

Claims

1. The network anomaly detection system based on the neural network is characterized by comprising a data preprocessing unit and a data unit for identifying network attack anomalies, wherein the data preprocessing unit is used for carrying out coding processing, data normalization and feature selection on numerous disordered original data in a KDDCup99 data set so as to remove redundancy and noise, the data unit for identifying the network attack anomalies is used for identifying anomaly attack categories including DOS, R2L, U2R and PROBING on the KDDCup99 data set.

2. The system of claim 1, wherein the data preprocessing unit comprises an encoding processing module, a data normalization module, and a feature selection module, and the encoding processing module encodes non-numerical data in numerous disordered raw data in the KDDCup99 data set, converts the non-numerical data in the raw data into numerical data, compresses each line of data in all the numerical data into a numerical value between [0 and 1] through the normalization module, obtains a KDDCup99 data set in an interval with an output range of [0 and 1], and reduces the dimensionality of the data in the data set through the feature selection module, and screens the KDDCup99 data set after normalization to obtain principal component data;

3. The system as claimed in claim 1, wherein the data unit for identifying abnormal network attacks includes an accuracy module and an observer operating characteristic curve drawing module, and the KDDCup99 dataset output by the data preprocessing unit after being subjected to dimensionality reduction is input into the mlpclasifier multi-layer perceptron classifier, then a prediction result of each piece of data in the KDDCup99 dataset is obtained by continuously adjusting the hyper-parameters, the accuracy and recall rate of the network abnormality detection system based on the neural network are obtained according to the prediction result, and finally the prediction result is input into the observer operating characteristic curve drawing module, and an ROC curve is drawn according to the obtained accuracy and recall rate, and the difference between the predicted value and the true value of the data in the dckdup 99 dataset is visually obtained through the ROC curve;

4. The method for detecting a neural network-based network anomaly detection system according to any one of claims 1 to 3, comprising the steps of:

5. The method according to claim 4, wherein the MLPClasifier multi-layer perceptron classifier in step 4 is designed by a neural network algorithm, and the specific outline is as follows: the MLPClasefi multi-layer perception machine classifier based on the neural network algorithm is designed with five hidden layers, the input of the first hidden layer is a KDDCup99 data set after dimension reduction processing, the input of the second hidden layer to the fourth hidden layer is output data of the previous hidden layer, the activation function of each hidden layer adopts a RELU function and adopts an L2 regularization mode to prevent overfitting, the regularization parameter is set to be 0.0001, the output layer adopts a linear function, the output of the last hidden layer is used as the input of the output layer to obtain the probability of each neural network state of each piece of data, the neural network state comprises a normal state and four abnormal states of DOS, R2L, U2R and PROBING, and the neural network state corresponding to the maximum probability is output, and the neural network state corresponding to the maximum probability is the prediction result of the classifier.

6. The method as claimed in claim 4, wherein the accuracy and the error rate in step 5 are input to an accuracy module to obtain the accuracy and the recall ratio of the neural network based network anomaly detection system, which is specifically expressed as:

inputting the correct rate and the error rate into an accuracy rate module, counting the number of the normal data as m, counting the number of the original data in the normal data as n, and calculating