CN112788038A

CN112788038A - Method for distinguishing DDoS attack and elephant flow based on PCA and random forest

Info

Publication number: CN112788038A
Application number: CN202110051338.XA
Authority: CN
Inventors: 缪祥华; 胡晓红; 袁梅宇
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-01-15
Filing date: 2021-01-15
Publication date: 2021-05-11

Abstract

The invention relates to a method for distinguishing DDoS attacks and elephant flows based on PCA and random forest, belonging to the technical field of attack detection in networks. Firstly, selecting a training set and a testing set from a DDoS data set, and simultaneously adding an elephant flow data set into the training set and the testing set respectively; then, carrying out PCA (principal component analysis) processing on the data in the training set to reduce the dimension to obtain a low-dimensional feature matrix; then putting the low-dimensional feature matrix into a random forest model for training to obtain a random forest classifier; and finally, inputting the test set sample into a trained random forest classifier to obtain a classification result. When DDoS attack occurs, a random forest is utilized to distinguish legal elephant flow and DDoS attack flow.

Description

Method for distinguishing DDoS attack and elephant flow based on PCA and random forest

Technical Field

The invention relates to a method for distinguishing DDoS attacks and elephant flows based on PCA and random forest, belonging to the technical field of attack detection in networks.

Background

Distributed Denial of Service (DDoS) attacks are an increasing problem in the internet. An attacker targets some servers (also called victims) and uses multiple puppet hosts to launch an attack, thereby preventing normal use of their services. DDoS attacks are of many types, but many legitimate flows also have similar characteristics to DDoS flows, and thus many detection methods discard legitimate flows with similar characteristics to DDoS flows. For example, elephant streams, generally carry large amounts of data and last for a long time. They are often used for bulk data transmission, and elephant flow is popular in certain networks, such as data center networks. Approximately 90% of the data bytes in the network are contributed by the elephant flow, but they account for only 1% of the total flow. Elephant flows can generate a large number of packets (in different time spans) and consume a large amount of server bandwidth, making it behave similarly to a DDoS attack. However, it is a fully legitimate normal stream. Therefore, the elephant flow and the DDoS flow should be distinguished to avoid blocking the DDoS attack when it is stopped.

Disclosure of Invention

In order to make up the defects of the prior art, the invention provides a method for distinguishing DDoS attacks and elephant flows based on PCA and random forests.

Principal component analysis can reduce the dimensionality of the data space under study. I.e. to replace the p-dimensional X space (m) with the m-dimensional Y space<p) and less information is lost by the low-dimensional Y space instead of the high-dimensional x space. Even if there is only one principal component Y_l(i.e., m is 1), this Y is_lAgain using all X variables (p). The invention processes data in advance, extracts their characteristics, analyzes data flow by using a principal component analysis method, and then puts into a random forest model for training, thereby distinguishing the elephant flow and the DDoS attack flow.

A method for distinguishing DDoS attacks and elephant flows based on PCA and random forest comprises the following steps:

the method comprises the following steps: and selecting a training set and a testing set from the DDoS data set, and simultaneously adding the elephant flow data set into the training set and the testing set respectively.

Step two: carrying out PCA (principal component analysis) processing on the data in the training set to reduce the dimension to obtain a low-dimensional feature matrix;

step three: putting the low-dimensional feature matrix into a random forest model for training to obtain a random forest classifier;

step four: and inputting the test set sample into a trained random forest classifier to obtain a classification result.

Specifically, the specific process of the second step is as follows:

(1) and solving the average value of each sample feature word and the sample average value.

(2) After the sample mean value is obtained, the feature sample mean value of the column needs to be subtracted from each dimension to obtain a new feature matrix.

(3) And after a new characteristic matrix is obtained, calculating a covariance matrix of the characteristic matrix to obtain a low-dimensional characteristic matrix.

Specifically, the specific process of the third step is as follows:

(1) inputting the eigenvector matrix obtained in the step two into a random forest model, training, and still remaining k characteristics of each piece of data after dimensionality reduction;

(2) if k characteristics of a piece of data are the same, marking the characteristics as a corresponding category of the data, and if the k characteristics are different, entering the step (3);

(3) and selecting a division basis, dividing the data, and distinguishing the characteristics of judging whether the data is a DDoS attack flow or an elephant flow to obtain a random forest classifier.

The characteristic vector matrix is a set of the most obvious characteristics of the flow characteristics, the method can extract the obvious characteristic difference of the DDoS attack flow and the legal elephant flow through a large amount of data, and then the DDoS attack flow and the legal elephant flow are placed into a random forest for training and classification, so that the DDoS attack flow and the legal elephant flow can be distinguished.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a flow chart of a method for processing data by PCA in the present invention.

Detailed Description

In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.

Example 1: as shown in fig. 1 and 2, a method for distinguishing DDoS attacks from elephant flows based on PCA and random forest includes the following steps:

Further, as shown in fig. 2, in the second step, the PCA process is used to reduce the dimensions of the data in the training set to obtain the low-dimensional feature matrix, and the specific process is as follows:

the first step is as follows: and extracting sample characteristics, namely basic characteristic data of the flow.

The second step is that: and sorting the sample characteristics, and generating a characteristic matrix for the sample characteristics. Suppose there are n samples (x, y, z, w represent sample features, here 4 features are taken as an example, actually more than 4 features)

Each column representing data of the same feature type, each row representing a different feature of the data at the same time, X₁The first generated feature matrix is shown (the generation of other feature matrices will be described below using a subscript).

The third step: the mean of the samples for each column is calculated.

First, the average value of each column of samples needs to be calculated:

after the mean value is calculated, the mean value of the samples in each column is calculated:

the fourth step: subtracting the characteristic sample mean value of the column from each dimension to obtain a new characteristic matrix X₂。

x_1i＝x_i-σ_xi，y_1i＝y_i-σ_yi，z_1i＝z_i-σ_zi，w_1i＝w_i-σ_wi

(x_1i，y_1i，z_1iAnd w_1iThe middle subscript "1" indicates that the mean value of the features of the column is subtracted from each dimension to obtain each corresponding element in the new feature matrix. )

The fifth step: computing a feature matrix X₂Covariance matrix of (2):

(X^Trepresenting the transpose of the matrix. )

And a sixth step: after the covariance matrix is obtained, eigenvalues and eigenvectors are obtained, and the eigenvalues are sorted in descending order.

The seventh step: and selecting the largest k eigenvectors, and then taking the k eigenvectors corresponding to the k eigenvectors as column vectors to form an eigenvector matrix to obtain the low-dimensional eigenvector matrix.

Further, the third step of training the low-dimensional feature matrix in a random forest model to obtain a random forest classifier comprises the following specific processes:

(1) inputting the feature vector matrix obtained in the second step into a random forest model, training, and still remaining k features of each piece of data after dimensionality reduction;

The characteristic vector matrix is a set of the most obvious characteristics of the flow characteristics, the method can extract the obvious characteristic difference characteristics of the DDoS attack flow and the legal elephant flow through a large amount of data, then put the DDoS attack flow and the legal elephant flow into a random forest model for training, and finally put a test data set into a trained random forest classifier for classifying the data set. When DDoS attack occurs, a random forest is utilized to distinguish legal elephant flow and DDoS attack flow, and the method is simple and efficient.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A method for distinguishing DDoS attack and elephant flow based on PCA and random forest is characterized in that: the method comprises the following specific steps:

the first step is as follows: selecting a training set and a testing set from the DDoS data set, and simultaneously adding the elephant flow data set into the training set and the testing set respectively;

the second step is that: carrying out PCA (principal component analysis) processing on the data in the training set to reduce the dimension to obtain a low-dimensional feature matrix;

the third step: putting the low-dimensional feature matrix into a random forest model for training to obtain a random forest classifier;

the fourth step: and inputting the test set sample into a trained random forest classifier to obtain a classification result.

2. The method of differentiating DDoS attacks from elephant flow based on PCA and random forest as claimed in claim 1, wherein: the second step is to perform PCA processing on the data in the training set to reduce the dimension and obtain a low-dimensional feature matrix, and the specific process is as follows:

(1) solving the average value of each sample feature word and the sample average value;

(2) after the sample mean value is solved, subtracting the characteristic sample mean value of the column from each dimension to obtain a new characteristic matrix;

3. The method of differentiating DDoS attacks from elephant flow based on PCA and random forest as claimed in claim 2, wherein: and the third step of putting the low-dimensional feature matrix into a random forest model for training, wherein the specific process of obtaining a random forest classifier is as follows: