CN112788038A - Method for distinguishing DDoS attack and elephant flow based on PCA and random forest - Google Patents

Method for distinguishing DDoS attack and elephant flow based on PCA and random forest Download PDF

Info

Publication number
CN112788038A
CN112788038A CN202110051338.XA CN202110051338A CN112788038A CN 112788038 A CN112788038 A CN 112788038A CN 202110051338 A CN202110051338 A CN 202110051338A CN 112788038 A CN112788038 A CN 112788038A
Authority
CN
China
Prior art keywords
random forest
data
pca
matrix
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110051338.XA
Other languages
Chinese (zh)
Inventor
缪祥华
胡晓红
袁梅宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202110051338.XA priority Critical patent/CN112788038A/en
Publication of CN112788038A publication Critical patent/CN112788038A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2135Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1458Denial of Service

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a method for distinguishing DDoS attacks and elephant flows based on PCA and random forest, belonging to the technical field of attack detection in networks. Firstly, selecting a training set and a testing set from a DDoS data set, and simultaneously adding an elephant flow data set into the training set and the testing set respectively; then, carrying out PCA (principal component analysis) processing on the data in the training set to reduce the dimension to obtain a low-dimensional feature matrix; then putting the low-dimensional feature matrix into a random forest model for training to obtain a random forest classifier; and finally, inputting the test set sample into a trained random forest classifier to obtain a classification result. When DDoS attack occurs, a random forest is utilized to distinguish legal elephant flow and DDoS attack flow.

Description

Method for distinguishing DDoS attack and elephant flow based on PCA and random forest
Technical Field
The invention relates to a method for distinguishing DDoS attacks and elephant flows based on PCA and random forest, belonging to the technical field of attack detection in networks.
Background
Distributed Denial of Service (DDoS) attacks are an increasing problem in the internet. An attacker targets some servers (also called victims) and uses multiple puppet hosts to launch an attack, thereby preventing normal use of their services. DDoS attacks are of many types, but many legitimate flows also have similar characteristics to DDoS flows, and thus many detection methods discard legitimate flows with similar characteristics to DDoS flows. For example, elephant streams, generally carry large amounts of data and last for a long time. They are often used for bulk data transmission, and elephant flow is popular in certain networks, such as data center networks. Approximately 90% of the data bytes in the network are contributed by the elephant flow, but they account for only 1% of the total flow. Elephant flows can generate a large number of packets (in different time spans) and consume a large amount of server bandwidth, making it behave similarly to a DDoS attack. However, it is a fully legitimate normal stream. Therefore, the elephant flow and the DDoS flow should be distinguished to avoid blocking the DDoS attack when it is stopped.
Disclosure of Invention
In order to make up the defects of the prior art, the invention provides a method for distinguishing DDoS attacks and elephant flows based on PCA and random forests.
Principal component analysis can reduce the dimensionality of the data space under study. I.e. to replace the p-dimensional X space (m) with the m-dimensional Y space<p) and less information is lost by the low-dimensional Y space instead of the high-dimensional x space. Even if there is only one principal component Yl(i.e., m is 1), this Y islAgain using all X variables (p). The invention processes data in advance, extracts their characteristics, analyzes data flow by using a principal component analysis method, and then puts into a random forest model for training, thereby distinguishing the elephant flow and the DDoS attack flow.
A method for distinguishing DDoS attacks and elephant flows based on PCA and random forest comprises the following steps:
the method comprises the following steps: and selecting a training set and a testing set from the DDoS data set, and simultaneously adding the elephant flow data set into the training set and the testing set respectively.
Step two: carrying out PCA (principal component analysis) processing on the data in the training set to reduce the dimension to obtain a low-dimensional feature matrix;
step three: putting the low-dimensional feature matrix into a random forest model for training to obtain a random forest classifier;
step four: and inputting the test set sample into a trained random forest classifier to obtain a classification result.
Specifically, the specific process of the second step is as follows:
(1) and solving the average value of each sample feature word and the sample average value.
(2) After the sample mean value is obtained, the feature sample mean value of the column needs to be subtracted from each dimension to obtain a new feature matrix.
(3) And after a new characteristic matrix is obtained, calculating a covariance matrix of the characteristic matrix to obtain a low-dimensional characteristic matrix.
Specifically, the specific process of the third step is as follows:
(1) inputting the eigenvector matrix obtained in the step two into a random forest model, training, and still remaining k characteristics of each piece of data after dimensionality reduction;
(2) if k characteristics of a piece of data are the same, marking the characteristics as a corresponding category of the data, and if the k characteristics are different, entering the step (3);
(3) and selecting a division basis, dividing the data, and distinguishing the characteristics of judging whether the data is a DDoS attack flow or an elephant flow to obtain a random forest classifier.
The characteristic vector matrix is a set of the most obvious characteristics of the flow characteristics, the method can extract the obvious characteristic difference of the DDoS attack flow and the legal elephant flow through a large amount of data, and then the DDoS attack flow and the legal elephant flow are placed into a random forest for training and classification, so that the DDoS attack flow and the legal elephant flow can be distinguished.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a flow chart of a method for processing data by PCA in the present invention.
Detailed Description
In order to more clearly understand the technical features, objects, and effects of the present invention, embodiments of the present invention will now be described with reference to the accompanying drawings.
Example 1: as shown in fig. 1 and 2, a method for distinguishing DDoS attacks from elephant flows based on PCA and random forest includes the following steps:
the method comprises the following steps: and selecting a training set and a testing set from the DDoS data set, and simultaneously adding the elephant flow data set into the training set and the testing set respectively.
Step two: carrying out PCA (principal component analysis) processing on the data in the training set to reduce the dimension to obtain a low-dimensional feature matrix;
step three: putting the low-dimensional feature matrix into a random forest model for training to obtain a random forest classifier;
step four: and inputting the test set sample into a trained random forest classifier to obtain a classification result.
Further, as shown in fig. 2, in the second step, the PCA process is used to reduce the dimensions of the data in the training set to obtain the low-dimensional feature matrix, and the specific process is as follows:
the first step is as follows: and extracting sample characteristics, namely basic characteristic data of the flow.
The second step is that: and sorting the sample characteristics, and generating a characteristic matrix for the sample characteristics. Suppose there are n samples (x, y, z, w represent sample features, here 4 features are taken as an example, actually more than 4 features)
Figure BDA0002899203890000031
Each column representing data of the same feature type, each row representing a different feature of the data at the same time, X1The first generated feature matrix is shown (the generation of other feature matrices will be described below using a subscript).
The third step: the mean of the samples for each column is calculated.
First, the average value of each column of samples needs to be calculated:
Figure BDA0002899203890000032
after the mean value is calculated, the mean value of the samples in each column is calculated:
Figure BDA0002899203890000033
Figure BDA0002899203890000034
the fourth step: subtracting the characteristic sample mean value of the column from each dimension to obtain a new characteristic matrix X2
x1i=xixi,y1i=yiyi,z1i=zizi,w1i=wiwi
(x1i,y1i,z1iAnd w1iThe middle subscript "1" indicates that the mean value of the features of the column is subtracted from each dimension to obtain each corresponding element in the new feature matrix. )
Figure BDA0002899203890000041
The fifth step: computing a feature matrix X2Covariance matrix of (2):
Figure BDA0002899203890000042
(XTrepresenting the transpose of the matrix. )
And a sixth step: after the covariance matrix is obtained, eigenvalues and eigenvectors are obtained, and the eigenvalues are sorted in descending order.
The seventh step: and selecting the largest k eigenvectors, and then taking the k eigenvectors corresponding to the k eigenvectors as column vectors to form an eigenvector matrix to obtain the low-dimensional eigenvector matrix.
Further, the third step of training the low-dimensional feature matrix in a random forest model to obtain a random forest classifier comprises the following specific processes:
(1) inputting the feature vector matrix obtained in the second step into a random forest model, training, and still remaining k features of each piece of data after dimensionality reduction;
(2) if k characteristics of a piece of data are the same, marking the characteristics as a corresponding category of the data, and if the k characteristics are different, entering the step (3);
(3) and selecting a division basis, dividing the data, and distinguishing the characteristics of judging whether the data is a DDoS attack flow or an elephant flow to obtain a random forest classifier.
The characteristic vector matrix is a set of the most obvious characteristics of the flow characteristics, the method can extract the obvious characteristic difference characteristics of the DDoS attack flow and the legal elephant flow through a large amount of data, then put the DDoS attack flow and the legal elephant flow into a random forest model for training, and finally put a test data set into a trained random forest classifier for classifying the data set. When DDoS attack occurs, a random forest is utilized to distinguish legal elephant flow and DDoS attack flow, and the method is simple and efficient.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (3)

1. A method for distinguishing DDoS attack and elephant flow based on PCA and random forest is characterized in that: the method comprises the following specific steps:
the first step is as follows: selecting a training set and a testing set from the DDoS data set, and simultaneously adding the elephant flow data set into the training set and the testing set respectively;
the second step is that: carrying out PCA (principal component analysis) processing on the data in the training set to reduce the dimension to obtain a low-dimensional feature matrix;
the third step: putting the low-dimensional feature matrix into a random forest model for training to obtain a random forest classifier;
the fourth step: and inputting the test set sample into a trained random forest classifier to obtain a classification result.
2. The method of differentiating DDoS attacks from elephant flow based on PCA and random forest as claimed in claim 1, wherein: the second step is to perform PCA processing on the data in the training set to reduce the dimension and obtain a low-dimensional feature matrix, and the specific process is as follows:
(1) solving the average value of each sample feature word and the sample average value;
(2) after the sample mean value is solved, subtracting the characteristic sample mean value of the column from each dimension to obtain a new characteristic matrix;
(3) and after a new characteristic matrix is obtained, calculating a covariance matrix of the characteristic matrix to obtain a low-dimensional characteristic matrix.
3. The method of differentiating DDoS attacks from elephant flow based on PCA and random forest as claimed in claim 2, wherein: and the third step of putting the low-dimensional feature matrix into a random forest model for training, wherein the specific process of obtaining a random forest classifier is as follows:
(1) inputting the feature vector matrix obtained in the second step into a random forest model, training, and still remaining k features of each piece of data after dimensionality reduction;
(2) if k characteristics of a piece of data are the same, marking the characteristics as a corresponding category of the data, and if the k characteristics are different, entering the step (3);
(3) and selecting a division basis, dividing the data, and distinguishing the characteristics of judging whether the data is a DDoS attack flow or an elephant flow to obtain a random forest classifier.
CN202110051338.XA 2021-01-15 2021-01-15 Method for distinguishing DDoS attack and elephant flow based on PCA and random forest Pending CN112788038A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110051338.XA CN112788038A (en) 2021-01-15 2021-01-15 Method for distinguishing DDoS attack and elephant flow based on PCA and random forest

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110051338.XA CN112788038A (en) 2021-01-15 2021-01-15 Method for distinguishing DDoS attack and elephant flow based on PCA and random forest

Publications (1)

Publication Number Publication Date
CN112788038A true CN112788038A (en) 2021-05-11

Family

ID=75756725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110051338.XA Pending CN112788038A (en) 2021-01-15 2021-01-15 Method for distinguishing DDoS attack and elephant flow based on PCA and random forest

Country Status (1)

Country Link
CN (1) CN112788038A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113422766A (en) * 2021-06-18 2021-09-21 北京理工大学 Network system security risk assessment method under DDoS attack
CN113645182A (en) * 2021-06-21 2021-11-12 上海电力大学 Random forest detection method for denial of service attack based on secondary feature screening
CN113746700A (en) * 2021-09-02 2021-12-03 中国人民解放军国防科技大学 Elephant flow rapid detection method and system based on probability sampling
CN114726653A (en) * 2022-05-24 2022-07-08 深圳市永达电子信息股份有限公司 Abnormal flow detection method and system based on distributed random forest

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107395590A (en) * 2017-07-19 2017-11-24 福州大学 A kind of intrusion detection method classified based on PCA and random forest
CN107872460A (en) * 2017-11-10 2018-04-03 重庆邮电大学 A kind of wireless sense network dos attack lightweight detection method based on random forest
CN108632279A (en) * 2018-05-08 2018-10-09 北京理工大学 A kind of multilayer method for detecting abnormality based on network flow
US20190253442A1 (en) * 2018-02-13 2019-08-15 Cisco Technology, Inc. Assessing detectability of malware related traffic

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107395590A (en) * 2017-07-19 2017-11-24 福州大学 A kind of intrusion detection method classified based on PCA and random forest
CN107872460A (en) * 2017-11-10 2018-04-03 重庆邮电大学 A kind of wireless sense network dos attack lightweight detection method based on random forest
US20190253442A1 (en) * 2018-02-13 2019-08-15 Cisco Technology, Inc. Assessing detectability of malware related traffic
CN108632279A (en) * 2018-05-08 2018-10-09 北京理工大学 A kind of multilayer method for detecting abnormality based on network flow

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RAZAN ABDULHAMMED ET AL.: "Efficient Network Intrusion Detection Using PCA-Based Dimensionality Reduction of Features", 《2019 INTERNATIONAL SYMPOSIUM ON NETWORKS, COMPUTERS AND COMMUNICATIONS (ISNCC)》 *
S. REVATHI ET AL.: "Detecting Denial of Service Attack Using Principal Component Analysis with Random Forest Classifier", 《INTERNATIONAL JOURNAL OF COMPUTER SCIENCE & ENGINEERING TECHNOLOGY (IJCSET)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113422766A (en) * 2021-06-18 2021-09-21 北京理工大学 Network system security risk assessment method under DDoS attack
CN113422766B (en) * 2021-06-18 2022-08-23 北京理工大学 Network system security risk assessment method under DDoS attack
CN113645182A (en) * 2021-06-21 2021-11-12 上海电力大学 Random forest detection method for denial of service attack based on secondary feature screening
CN113645182B (en) * 2021-06-21 2023-07-14 上海电力大学 Denial of service attack random forest detection method based on secondary feature screening
CN113746700A (en) * 2021-09-02 2021-12-03 中国人民解放军国防科技大学 Elephant flow rapid detection method and system based on probability sampling
CN113746700B (en) * 2021-09-02 2023-04-07 中国人民解放军国防科技大学 Elephant flow rapid detection method and system based on probability sampling
CN114726653A (en) * 2022-05-24 2022-07-08 深圳市永达电子信息股份有限公司 Abnormal flow detection method and system based on distributed random forest

Similar Documents

Publication Publication Date Title
CN112788038A (en) Method for distinguishing DDoS attack and elephant flow based on PCA and random forest
CN111340191B (en) Bot network malicious traffic classification method and system based on ensemble learning
CN110311829B (en) Network traffic classification method based on machine learning acceleration
CN110391958B (en) Method for automatically extracting and identifying characteristics of network encrypted flow
CN110796196A (en) Network traffic classification system and method based on depth discrimination characteristics
CN111740971A (en) Network intrusion detection model SGM-CNN based on class imbalance processing
CN110808971B (en) Deep embedding-based unknown malicious traffic active detection system and method
US10187412B2 (en) Robust representation of network traffic for detecting malware variations
CN110751222A (en) Online encrypted traffic classification method based on CNN and LSTM
CN111885059A (en) Method for detecting and positioning abnormal industrial network flow
CN113489685B (en) Secondary feature extraction and malicious attack identification method based on kernel principal component analysis
CN111786951B (en) Traffic data feature extraction method, malicious traffic identification method and network system
CN111817971B (en) Data center network flow splicing method based on deep learning
CN112597141B (en) Network flow detection method based on public opinion analysis
CN108629183A (en) Multi-model malicious code detecting method based on Credibility probability section
Sarraf Analysis and detection of ddos attacks using machine learning techniques
Guo et al. A Black‐Box Attack Method against Machine‐Learning‐Based Anomaly Network Flow Detection Models
CN116192523A (en) Industrial control abnormal flow monitoring method and system based on neural network
Wu et al. Bottrinet: A unified and efficient embedding for social bots detection via metric learning
McCarthy et al. Feature vulnerability and robustness assessment against adversarial machine learning attacks
Jia et al. MMF: A loss extension for feature learning in open set recognition
CN111224998A (en) Botnet identification method based on extreme learning machine
Kim et al. High‐Performance Internet Traffic Classification Using a Markov Model and Kullback‐Leibler Divergence
CN108494620A (en) Network service flow feature selecting and sorting technique based on multiple target Adaptive evolvement arithmetic
CN113128626A (en) Multimedia stream fine classification method based on one-dimensional convolutional neural network model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210511

RJ01 Rejection of invention patent application after publication