CN111970169B - Protocol flow identification method based on GRU network - Google Patents

Protocol flow identification method based on GRU network Download PDF

Info

Publication number
CN111970169B
CN111970169B CN202010820902.5A CN202010820902A CN111970169B CN 111970169 B CN111970169 B CN 111970169B CN 202010820902 A CN202010820902 A CN 202010820902A CN 111970169 B CN111970169 B CN 111970169B
Authority
CN
China
Prior art keywords
layer
data
gru
gru network
protocol
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010820902.5A
Other languages
Chinese (zh)
Other versions
CN111970169A (en
Inventor
余顺争
汪擎天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202010820902.5A priority Critical patent/CN111970169B/en
Publication of CN111970169A publication Critical patent/CN111970169A/en
Application granted granted Critical
Publication of CN111970169B publication Critical patent/CN111970169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/18Protocol analysers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a protocol flow identification method based on a GRU network, which comprises the following steps: carrying out data preprocessing on different protocol flow samples to obtain a training sample set which accords with a GRU network input data format, and training a GRU network model by using the training sample set; carrying out data preprocessing on unknown protocol flow to obtain spatial position characteristic data with a time sequence, and inputting the spatial position characteristic data into a GRU network model after training; and identifying unknown protocol flow after data preprocessing by using the trained GRU network model, and finally obtaining a prediction label. The invention completes the feature extraction of the data packet through data preprocessing, and can effectively overcome the difficulty of manually extracting the features; moreover, the construction and the use of the GRU network model effectively improve the accuracy of protocol identification; in addition, the information in the flow interaction process relates to two levels of space position characteristics and time sequence characteristics, so that the protocol flow identification effect is more obvious.

Description

Protocol flow identification method based on GRU network
Technical Field
The invention relates to the field of computer network traffic analysis, in particular to a protocol traffic identification method based on a GRU network.
Background
The protocol flow identification means that key features capable of identifying the network protocol are extracted from the network flow borne by the TCP/IP protocol through manual analysis or an automatic means, and then the protocol to which the network flow belongs is accurately identified on the basis of the features. The protocol identification technology is beneficial to analyzing the composition of network flow, and can provide data support for a plurality of research fields such as network management and maintenance, network content audit, network security defense and the like. However, in the face of large-scale, diversified and high-capacity network traffic nowadays, how to improve the accuracy of protocol identification is a great challenge.
The protocol flow identification method mainly comprises a protocol identification method based on a preset rule, a protocol identification method based on load characteristics, a protocol identification method based on host behaviors and a protocol identification method based on machine learning. Deep learning has advantages in classification, but the existing protocol traffic identification method also has the problem of difficulty in manually extracting features.
In the prior art, chinese patent publication No. CN107682216A discloses a network traffic protocol recognition method based on deep learning in 09.02/2018, which uses the similarity between network flow data and an image to bypass the work of selecting and extracting traffic characteristic values, directly uses the network flow data as the input of a convolutional neural network, performs supervised learning, trains a network traffic protocol recognition model, and realizes a network traffic protocol recognition function. Although the scheme uses the network traffic protocol sample to be identified for training the convolutional neural network, the features beneficial to the classification task can be automatically extracted to a certain extent, but the problems of difficult feature extraction and low protocol identification accuracy rate in the existing manual extraction are not solved, so that a protocol traffic identification method based on a GRU network is urgently needed by users.
Disclosure of Invention
The invention provides a protocol flow identification method based on a GRU network, aiming at solving the problems of difficult manual feature extraction and low protocol identification accuracy rate in the prior art.
The primary objective of the present invention is to solve the above technical problems, and the technical solution of the present invention is as follows:
a protocol traffic identification method based on a GRU network comprises the following steps:
s1: carrying out data preprocessing on different protocol flow samples to obtain a training sample set which accords with a GRU network input data format, and training a GRU network model by using the training sample set;
s2: carrying out data preprocessing on unknown protocol flow to obtain spatial position characteristic data with a time sequence, and inputting the spatial position characteristic data into a GRU network model after training;
s3: and identifying unknown protocol flow after data preprocessing by using the trained GRU network model, and finally obtaining a prediction label.
Preferably, the data preprocessing in steps S1 and S2 includes traffic segmentation, packet clustering, and session data transformation.
Preferably, the basic unit of the traffic slicing is a session.
Preferably, the data packet clustering is performed by using a K-means algorithm.
Preferably, the session data conversion is to replace the content format of each data packet after traffic segmentation with a distance set, and the adopted distance calculation formula is as follows:
Figure BDA0002634378850000021
the Max Subsequence function is a longest common continuous sequence identification algorithm between each data packet and each clustering center; d(x,centroid)The distance of each packet from each cluster center.
Preferably, the GRU network model includes an input layer, a Masking layer, a first GRU layer, a second GRU layer, a full connection layer, and an output layer; wherein:
the Masking layer is respectively connected to the input layer and the first GRU layer;
the second GRU layer is connected to the first GRU layer and the full connection layer respectively;
the output layer is connected with the full connection layer.
Preferably, the dimensions of the extracted feature values of the first and second GRU layers are set to 64.
Preferably, the fully connected layer employs a ReLU function as an activation function.
Preferably, the fully connected layer is set to a ratio of 0.5 using Dropout.
Preferably, the output layer adopts a Sigmoid function as the activation function.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention completes the feature extraction of the data packet through data preprocessing, and can effectively overcome the difficulty of manually extracting the features; moreover, the construction and the use of the GRU network model effectively improve the accuracy of protocol identification; in addition, the information in the flow interaction process relates to two levels of the spatial position characteristics of the data packets and the time sequence characteristics between the data packets, so that the protocol flow identification effect is more obvious.
Drawings
FIG. 1 is a general flow chart of the present invention;
FIG. 2 is a flow chart of the GRU network model identifying unknown protocol traffic in the present invention;
fig. 3 is a schematic structural diagram of the GRU network model according to the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a more particular description of the invention will be rendered by reference to the appended drawings. It should be noted that the embodiments and features of the embodiments of the present application may be combined with each other without conflict.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those specifically described herein, and therefore the scope of the present invention is not limited by the specific embodiments disclosed below.
Example 1
As shown in fig. 1, a protocol traffic identification method based on a GRU network includes the following steps:
s1: carrying out data preprocessing on different protocol flow samples to obtain a training sample set which accords with a GRU network input data format, and training a GRU network model by using the training sample set;
s2: carrying out data preprocessing on unknown protocol flow to obtain spatial position characteristic data with a time sequence, and inputting the spatial position characteristic data into a GRU network model after training;
s3: and identifying unknown protocol flow after data preprocessing by using the trained GRU network model, and finally obtaining a prediction label.
In the scheme, the method is divided into two stages, wherein the first stage is a training stage, and a training sample set is used for completing the training of the GRU network model; and the second stage is an identification stage, wherein the trained GRU network model is used for identifying the unknown protocol flow after data preprocessing to obtain a prediction label.
As shown in fig. 2, specifically, the data preprocessing in steps S1 and S2 includes traffic segmentation, packet clustering, and session data conversion.
In the above scheme, the data preprocessing is a basic step in the unknown protocol traffic identification process, wherein: the flow segmentation is responsible for segmenting unknown protocol flow into data sets in corresponding forms according to a certain basis; clustering the data packets, wherein the clustering is responsible for clustering all the data packets in the data set to obtain a clustering center; conversation data conversion, which is responsible for converting the contents of all the data packets into the distance between each data packet and each clustering center; and finally, integrating according to the time sequence relation of each data packet, converting the unknown protocol flow into space position characteristic data with a time sequence, and conforming to the input data format of the GRU network model.
Specifically, the basic unit of the traffic segmentation is a session.
In the above scheme, in the selection of the traffic granularity, a session which is currently researched more is adopted, and the session has all packets of the same five-tuple (source IP, source port, destination IP, destination port, transport layer protocol), and the source and destination addresses in the five-tuple can be interchanged.
Specifically, the data packet clustering is performed by using a K-means algorithm.
In the scheme, the K mean algorithm is easy to realize, has an optimization iteration function and can eliminate unreasonable classification of the training sample set.
Specifically, the session data conversion is to replace the content format of each data packet after traffic segmentation with a distance set, and the adopted distance calculation formula is as follows:
Figure BDA0002634378850000041
the Max Subsequence function is a longest common continuous sequence identification algorithm between each data packet and each clustering center; d(x,centroid)The distance of each packet from each cluster center.
In the scheme, the distance calculation formula is adopted to complete the calculation of the distance between each data packet and each cluster center.
As shown in fig. 3, specifically, the GRU network model includes an input layer, a Masking layer, a first GRU layer, a second GRU layer, a full connection layer, and an output layer; wherein:
the Masking layer is respectively connected to the input layer and the first GRU layer;
the second GRU layer is connected to the first GRU layer and the full connection layer respectively;
the output layer is connected with the full connection layer.
In the scheme, firstly, a Masking layer skips the complete data in a training sample set; secondly, two layers of GRU gating cycle units are connected continuously, the parameter return _ sequences of the first GRU layer is TRUE, the result of each time step is output to the second GRU layer, in the calculation process, due to the existence of an updating gate and a resetting gate mechanism in the GRU network, the state information of the previous time can be kept and transmitted to the current time, and the state information can be brought to the same degree, and the time sequence characteristic information in the conversation flow can be fully extracted; furthermore, the full connection layer is provided with 256 neurons, so that the nonlinear expression capability of the learning capability of the GRU network model is ensured; and finally, the output layer outputs the identification result.
Specifically, the dimensions of the extracted feature values of the first GRU layer and the second GRU layer are both set to 64.
In the scheme, the set dimension can ensure that the most effective characteristics can be found out by the two layers of GRU gating circulating units, the dimension reduction effect is achieved, and redundancy is avoided.
Specifically, the fully connected layer employs a ReLU function as an activation function.
In the scheme, the ReLU function is used as the activation function, so that not only is the time and space complexity lower, but also the problem of gradient disappearance can be avoided.
Specifically, the fully connected layer is set to a ratio of 0.5 using Dropout.
In the scheme, the Dropout mechanism is adopted to lose 50% of characteristics, so that the structure can be greatly simplified, the neural network overfitting problem can be prevented, and the cost of excessive time can be avoided.
Specifically, the output layer adopts a Sigmoid function as an activation function.
In the scheme, only one output node is arranged on the output layer, the output result is the probability that the unknown protocol flow belongs to a certain protocol type, and the Sigmoid function is adopted as the activation function, so that the output value is between 0 and 1, and the requirement of two classifications is met.
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims (9)

1. A protocol traffic identification method based on a GRU network is characterized by comprising the following steps:
s1: carrying out data preprocessing on different protocol flow samples to obtain a training sample set which accords with a GRU network input data format, and training a GRU network model by using the training sample set;
s2: carrying out data preprocessing on unknown protocol flow to obtain spatial position characteristic data with a time sequence, and inputting the spatial position characteristic data into a GRU network model after training;
s3: identifying unknown protocol flow after data preprocessing by using the trained GRU network model to finally obtain a prediction label;
the data preprocessing in the steps S1 and S2 comprises flow segmentation, data packet clustering and session data conversion; wherein: the flow segmentation is responsible for segmenting unknown protocol flow into data sets in corresponding forms according to a certain basis; clustering the data packets, wherein the clustering is responsible for clustering all the data packets in the data set to obtain a clustering center; conversation data conversion, which is responsible for converting the contents of all the data packets into the distance between each data packet and each clustering center; and finally, integrating according to the time sequence relation of each data packet, and converting the unknown protocol flow into space position characteristic data with a time sequence.
2. The method of claim 1, wherein a basic unit of the traffic segmentation is a session.
3. The method of claim 1, wherein the packet clustering is performed by using a K-means algorithm.
4. The method for identifying protocol traffic based on a GRU network according to claim 1, wherein the session data conversion is to replace the content format of each packet after traffic segmentation with a distance set, and the distance calculation formula adopted is as follows:
Figure FDA0003457793830000011
the Max Subsequence function is a longest common continuous sequence identification algorithm between each data packet and each clustering center; d(x,centroid)The distance of each packet from each cluster center.
5. The method according to claim 1, wherein the GRU network model includes an input layer, a Masking layer, a first GRU layer, a second GRU layer, a full connection layer, and an output layer; wherein:
the Masking layer is respectively connected to the input layer and the first GRU layer;
the second GRU layer is connected to the first GRU layer and the full connection layer respectively;
the output layer is connected with the full connection layer.
6. The method of claim 5, wherein the dimension of the extracted feature values of the first GRU layer and the second GRU layer are set to 64.
7. The method of claim 5, wherein the full connectivity layer employs a ReLU function as the activation function.
8. The method as claimed in claim 5, wherein the full connectivity layer is configured to set the ratio to 0.5 using Dropout.
9. The GRU network-based protocol traffic identification method of claim 5, wherein the output layer adopts a Sigmoid function as an activation function.
CN202010820902.5A 2020-08-14 2020-08-14 Protocol flow identification method based on GRU network Active CN111970169B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010820902.5A CN111970169B (en) 2020-08-14 2020-08-14 Protocol flow identification method based on GRU network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010820902.5A CN111970169B (en) 2020-08-14 2020-08-14 Protocol flow identification method based on GRU network

Publications (2)

Publication Number Publication Date
CN111970169A CN111970169A (en) 2020-11-20
CN111970169B true CN111970169B (en) 2022-03-08

Family

ID=73388920

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010820902.5A Active CN111970169B (en) 2020-08-14 2020-08-14 Protocol flow identification method based on GRU network

Country Status (1)

Country Link
CN (1) CN111970169B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112671757B (en) * 2020-12-22 2023-10-31 无锡江南计算技术研究所 Encryption flow protocol identification method and device based on automatic machine learning
CN112910881A (en) * 2021-01-28 2021-06-04 武汉市博畅软件开发有限公司 Data monitoring method and system based on communication protocol
CN115150165B (en) * 2022-06-30 2024-03-15 北京天融信网络安全技术有限公司 Flow identification method and device

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10923109B2 (en) * 2017-08-02 2021-02-16 [24]7.ai, Inc. Method and apparatus for training of conversational agents
CN109583656B (en) * 2018-12-06 2022-05-10 重庆邮电大学 Urban rail transit passenger flow prediction method based on A-LSTM
CN110011931B (en) * 2019-01-25 2020-10-16 中国科学院信息工程研究所 Encrypted flow type detection method and system
CN110287439A (en) * 2019-06-27 2019-09-27 电子科技大学 A kind of network behavior method for detecting abnormality based on LSTM
CN110751222A (en) * 2019-10-25 2020-02-04 中国科学技术大学 Online encrypted traffic classification method based on CNN and LSTM
CN111209933A (en) * 2019-12-25 2020-05-29 国网冀北电力有限公司信息通信分公司 Network traffic classification method and device based on neural network and attention mechanism
CN111209563B (en) * 2019-12-27 2022-04-08 北京邮电大学 Network intrusion detection method and system

Also Published As

Publication number Publication date
CN111970169A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN111970169B (en) Protocol flow identification method based on GRU network
Liu et al. CNN and RNN based payload classification methods for attack detection
CN108874782B (en) A kind of more wheel dialogue management methods of level attention LSTM and knowledge mapping
Bansal et al. Zero-shot object detection
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN110532564B (en) On-line identification method for application layer protocol based on CNN and LSTM hybrid model
CN111144448A (en) Video barrage emotion analysis method based on multi-scale attention convolutional coding network
CN107368534B (en) Method for predicting social network user attributes
Zhu et al. Semi-supervised streaming learning with emerging new labels
CN111143553A (en) Method and system for identifying specific information of real-time text data stream
Wang et al. Time-variant graph classification
CN115292568B (en) Civil news event extraction method based on joint model
Basri et al. Bangla handwritten digit recognition using deep convolutional neural network
Liu et al. WBCaps: a capsule architecture-based classification model designed for white blood cells identification
CN111191033A (en) Open set classification method based on classification utility
Guo et al. Offline handwritten Tai Le character recognition using ensemble deep learning
CN109002808A (en) A kind of Human bodys' response method and system
CN116663019B (en) Source code vulnerability detection method, device and system
CN117633627A (en) Deep learning unknown network traffic classification method and system based on evidence uncertainty evaluation
CN115334179A (en) Unknown protocol reverse analysis method based on named entity recognition
Blanger et al. A face recognition library using convolutional neural networks
Tao et al. A lightweight convolutional neural network for license plate character recognition
Kale et al. Computer vision and information technology: advances and applications
CN116994104B (en) Zero sample identification method and system based on tensor fusion and contrast learning
Hu et al. A robust IoT device identification method with unknown traffic detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant