CN114448905B

CN114448905B - Encryption traffic identification method, system, terminal and storage medium

Info

Publication number: CN114448905B
Application number: CN202011231169.XA
Authority: CN
Inventors: 叶可江; 林鹏; 胡奕绅; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2024-04-19
Anticipated expiration: 2040-11-06
Also published as: CN114448905A

Abstract

The application relates to an encrypted traffic identification method, an encrypted traffic identification system, a terminal and a storage medium. Comprising the following steps: acquiring network flow data packets, learning the relevance among byte contents of each network flow data packet, and training a neural network model with data packet coding capability; encoding byte content of the network traffic data packet by the neural network model; learning time sequence relations among the coded network traffic data packets, obtaining characteristic representations of the network traffic data packets, and learning length information of the network traffic data packets; and fusing the characteristic representation and the length information of each network flow data packet and classifying Softmax to obtain the flow identification result of each network flow data packet. The application can ensure that the neural network can learn the original byte information and the length information of the data packet, and achieve better encryption flow identification effect while maintaining the information integrity of the data packet.

Description

Encryption traffic identification method, system, terminal and storage medium

Technical Field

The application belongs to the technical field of traffic identification, and particularly relates to an encrypted traffic identification method, an encrypted traffic identification system, a terminal and a storage medium.

Background

Traffic identification, which aims at classifying different network traffic into suitable categories, is a fundamental task in network management and network space security. The traditional traffic identification method mainly adopts a method based on port numbers, and the method carries out port matching according to a list provided by IANA (INTERNET ASSIGNED Numbers Authority, internet number distribution bureau) to determine the type of traffic. But this approach has become unreliable as more and more applications masquerade using dynamically allocated ports or generic communication protocol ports. Meanwhile, with the increasing awareness of security and privacy of people, most of application traffic is currently encrypted through various encryption protocols, such as IPsec, SSL/TLS, SSH, etc., which makes the traditional traffic classification method ineffective.

In recent years, some students use flow characteristics (such as data packet message types, packet length sequences, statistical characteristics and the like) of encrypted flows to model in combination with a machine learning method, so that a certain effect is achieved. The method specifically comprises the following steps:

1. A classification method based on message type; the header portion of each SSL/TLS has a field identifying the message type of the packet, which can abstract the packet sequence into a sequence of message types with different probability transition relationships between different classes of message types. The method based on the message type is to learn state transition matrixes of different message types by establishing a Markov model of the message type. However, considering the computational problem, the message sequence based approach can basically only be trained using a first or second order markov model, that is, it can only be trained using 2 or 3 time steps of data, so the learned time information is very limited. At the same time, since the number of message types is very small, this results in a sequence of similar message types between many different traffic, overlapping message types can result in a low differentiation between traffic, and different categories of traffic of similar message types will not be exactly separated. In addition, for the method of combining handshake information, not all SSL/TLS traffic will contain this information in the real scenario: when the session just lost is recovered in a short time, the client and the server do not need to carry out handshake again, and the network traffic does not contain handshake packet information.

2. A classification method based on a length sequence; the length sequence based approach is similar to the message type based approach in that it abstracts the network stream into a length sequence and then models the sequence using a Markov model, or other machine learning approach. The disadvantage of this method is that: merely representing a packet by its length is obviously a very naive simplification and tends to lose a lot of detail. When the lengths of the packets are the same or close (e.g., packet fragmentation at the IP layer), the length sequence will lose differentiation.

3. A method based on statistical features; the main idea of this type of method is to extract the flow level of the network packets to represent a communication flow, and then classify it in combination with other machine learning algorithms. These statistical features typically include the average size, average interval, transmission rate, etc. of the data packets, and there are many open source tools that provide the extraction of these features. The disadvantage of this method is that: (1) Features are highly abstract such that fine-grained operations are not possible (e.g., learning the association between two packets); (2) Extracting the flow statistics requires setting a listening interval, say 10s,15s, which makes real-time traffic classification impossible.

Disclosure of Invention

The application provides an encryption traffic identification method, an encryption traffic identification system, a terminal and a storage medium, which aim to solve at least one of the technical problems in the prior art to a certain extent.

In order to solve the problems, the application provides the following technical scheme:

An encrypted traffic identification method comprising the steps of:

acquiring network traffic data packets, learning the relevance among byte contents of each network traffic data packet by using Transformer Encoder, and training a neural network model with data packet coding capability;

Encoding byte content of the network traffic data packet by the neural network model;

Using a transducer to learn the time sequence relation among the coded network traffic data packets, obtaining the characteristic representation of each network traffic data packet, and using a bidirectional LSTM to learn the length information of each network traffic data packet;

And fusing the characteristic representation and the length information of each network flow data packet and classifying Softmax to obtain the flow identification result of each network flow data packet.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the training of a neural network model with data packet encoding capability includes:

and constructing an encryption traffic identification model, wherein the encryption traffic identification model comprises a pre-training layer, a data packet coding layer, a time sequence layer, a supplement layer and a classification layer, and training the neural network model at the pre-training layer of the encryption traffic identification model.

All network traffic packets are grouped according to the same five-tuple: dividing the source IP, the target IP, the source port, the target port and the transmission protocol, wherein each group represents a bidirectional communication flow;

Extracting byte content of each network flow data packet above an IP layer, and converting the extracted byte content into a 16-system file;

Randomly masking byte contents in each 16-system file according to a set proportion, and adding a [ PACKET ] mark into the head of each file respectively;

Learning associations between byte content of the individual network traffic packets using Transformer Encoder and recovering masked byte content, training the neural network model using cross entropy as a loss function.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the encoding byte content of the network traffic data packet by the neural network model includes:

Extracting byte content of each network flow data packet above an IP layer at a data packet coding layer of the encryption flow identification model, and converting the extracted byte content into a 16-system file;

Respectively adding a [ PACKET ] mark into the head of each 16-system file, and cutting or filling the byte content of each 16-system file to a preset length;

And after the byte content of each 16-system file is respectively encoded by using the neural network model, using a vector corresponding to each [ PACKET ] tag as a vector representation of a corresponding network flow data PACKET.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the step of using a transducer to learn the time sequence relation among the encoded network traffic data packets, and the step of obtaining the characteristic representation of each network traffic data packet comprises the following steps:

At the time sequence layer of the encryption traffic identification model, vector e ⁱ of each network traffic data packet is processed by Transformer Encoder respectively, so that information of other network traffic data packets is fused, and a new vector representation v ⁱ＝Transformer(eⁱ of each network traffic data packet is obtained;

Splicing the vector representation v ⁱ of each network flow data packet to obtain the characteristic representation h ₁＝Concat(v¹,v²,…,v^m)W_o of each network flow data packet; where W _o is the weight matrix of the neural network, concat represents stitching the two vectors.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the learning the length information of each network traffic packet using the bidirectional LSTM includes:

Extracting original length information of each network traffic data packet at a complementary layer of the encryption traffic recognition model respectively, and constructing a length sequence ：L＝{l₁,l₂,…,l_m}＝{length(p¹),length(p²),…,length(p^m)};, wherein l _i represents the length of the data packet, p ⁱ represents the ith data packet, and length () represents the length information of the extracted data packet;

The length sequence L is learned using a bi-directional LSTM to obtain length information h₂＝Concat(LSTM→(l¹,l²,…,l^m),LSTM←(l¹,l²,…,l^m)); for each network traffic packet, where L ⁱ represents the length of the packet and Concat represents the concatenation of the two vectors.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the fusing the characteristic representation and the length information of each network traffic data packet, and the Softmax classification includes:

At a classification layer of the encrypted traffic identification model, performing full connection and Softmax classification on h ₁ in the time sequence layer to obtain a predicted value gamma ₁: and calculates the cross entropy loss function value loss ₁; wherein W, b is the parameter to be learned by the neural network;

Full ligation and Softmax classification of h ₂ in the supplementary layer resulted in the predicted value γ ₂: and calculates the cross entropy loss function value loss ₂;

splicing the h ₁、h₁ to obtain h ₃＝Concat(h₁,h₂);

full ligation and Softmax classification of h ₃ gave the predicted value γ ₃: And calculates the cross entropy loss function value loss ₃;

calculate the sum of loss ₁、loss₂、loss₃ And updating network parameters of the encrypted traffic identification model by adopting a gradient descent algorithm according to the calculation result.

The embodiment of the application adopts another technical scheme that: an encrypted traffic identification system comprising:

The pre-training module: the method comprises the steps of acquiring network traffic data packets, learning the relevance among byte contents of each network traffic data packet by using Transformer Encoder, and training a neural network model with data packet coding capability;

and a data packet coding module: encoding byte content of the network traffic data packet by the neural network model;

And the characteristic learning module is used for: the method comprises the steps of using a transducer to learn time sequence relations among all encoded network traffic data packets and obtaining characteristic representations of all network traffic data packets;

and a length learning module: learning length information of each network traffic packet using a bidirectional LSTM;

Fusion and classification module: and the method is used for fusing the characteristic representation and the length information of each network flow data packet and classifying Softmax to obtain the flow identification result of each network flow data packet.

The embodiment of the application adopts the following technical scheme: a terminal comprising a processor, a memory coupled to the processor, wherein,

The memory stores program instructions for implementing the encrypted traffic identification method;

The processor is configured to execute the program instructions stored by the memory to control encrypted traffic identification.

The embodiment of the application adopts the following technical scheme: a storage medium storing program instructions executable by a processor for performing the encrypted traffic identification method.

Compared with the prior art, the embodiment of the application has the beneficial effects that: according to the encrypted network traffic identification method, the encrypted network traffic identification system, the encrypted network traffic identification terminal and the encrypted network traffic identification storage medium, through the unsupervised pre-training method that part of data packet byte contents are randomly covered and recovered through a transducer, the relevance among different data packets can be well learned, network traffic bytes are better expressed, and therefore better data packet coding capacity is achieved; the method has the advantages that the characteristic representation, the length information and the fusion of the characteristic representation and the length information are respectively subjected to one-time loss function value calculation, and the network parameters are updated in a gradient manner by using the sum of the three loss function values, so that the neural network can learn the original byte information and the length information of the data packet, the network performance is improved, and the better encryption flow identification effect is achieved while the information integrity of the data packet is maintained.

Drawings

FIG. 1 is a flow chart of an encrypted traffic identification method according to a first embodiment of the present application;

FIG. 2 is a flow chart of an encrypted traffic identification method according to a second embodiment of the present application;

FIG. 3 is a schematic diagram of an encrypted traffic identification model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an encrypted traffic identification system according to an embodiment of the present application;

Fig. 5 is a schematic diagram of a terminal structure according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

Aiming at the defects of the prior art, the encryption traffic identification method of the embodiment of the application provides a data packet level end-to-end encryption traffic identification framework, and the framework adopts a strategy of data packet original byte + length sequence, adopts an unsupervised traffic pre-training and self-attention mechanism method to learn deep connection among data packets so as to keep the information integrity of the data packets and achieve better identification effect.

Specifically, please refer to fig. 1, which is a flowchart of an encrypted traffic identification method according to a first embodiment of the present application. The encrypted traffic identification method of the first embodiment of the present application includes the steps of:

S1: acquiring network traffic data packets, learning the relevance among byte contents of each network traffic data packet by using Transformer Encoder, and training a neural network model with data packet coding capability;

S2: encoding byte content of the network traffic data packet by the neural network model;

S3: using a transducer to learn the time sequence relation among the coded network traffic data packets, obtaining the characteristic representation of each network traffic data packet, and using a bidirectional LSTM to learn the length information of each network traffic data packet;

s4: and fusing the characteristic representation and the length information of each network flow data packet and classifying Softmax to obtain the flow identification result of each network flow data packet.

Referring to fig. 2, a flow chart of an encrypted traffic identification method according to a second embodiment of the present application is shown. The encrypted traffic identification method according to the second embodiment of the present application includes the steps of:

S10: building an end-to-end encryption traffic identification model of a data packet level;

In this step, the encrypted traffic identification model structure is shown in fig. 3, and includes a pre-training layer, a packet coding layer, a timing layer, a complementary layer, and a classification layer.

S20: acquiring a large number of unsupervised network traffic data packets in a pre-training layer, and learning the relevance among byte contents of each network traffic data packet by using Transformer Encoder (encoder) to train a neural network model with data packet coding capability;

in this step, the training process of the pre-training layer specifically includes:

S21: all network traffic packets are grouped according to the same five-tuple: dividing the source IP, the target IP, the source port, the target port and the transmission protocol, wherein each group represents one bidirectional communication flow;

S22: extracting byte content of each network flow data packet above an IP layer, and converting the extracted byte content into a 16-system file;

S23: according to a set proportion (the proportion is set to be 15% in the embodiment of the application, and the setting can be carried out according to actual operation), the byte content in each 16-system file is randomly covered, and a [ PACKET ] mark is added to the head of each file;

S24: learning the relevance between byte contents of each network traffic data packet by using Transformer Encoder, recovering the masked byte contents, and training a neural network model with the data packet coding capability by using cross entropy as a loss function.

In the above, the embodiment of the application adopts the unsupervised pretraining method of randomly covering part of byte content of the data packet and recovering the data packet through the transducer to train the neural network model in the pretraining layer, so that the network traffic bytes can be better expressed, and the better data packet coding capability can be achieved.

S30: in a data packet coding layer, respectively coding each byte content of a network traffic data packet by utilizing a neural network model obtained by training of a pre-training layer, and sending the coded network traffic data packet into a time sequence layer;

in this step, the implementation process of the neural network model for encoding the network traffic data packet specifically includes:

s31: extracting byte content of each network flow data packet above an IP layer, and converting the extracted byte content into a 16-system file;

S32: respectively adding a [ PACKET ] mark into the head of each 16-system file, and cutting or filling the byte content of each 16-system file to a preset length;

S33: after byte contents (including [ PACKET ] marks) of each 16-system file are respectively encoded by using a neural network model, vectors corresponding to the [ PACKET ] marks are respectively used as vector representations of corresponding network flow data PACKETs, and the vectors are sent to a time sequence layer;

In the above, the embodiment of the present application supplements the length information of each data packet in the data packet coding layer, so as to prevent the loss of the length information caused by cutting or filling, and maintain the integrity of the data packet information as much as possible.

S40: in the time sequence layer, using a transducer to learn the time sequence relation among the network flow data packets, and acquiring the characteristic representation of each network flow data packet;

In this step, the learning process of the timing sequence layer to the timing sequence relationship specifically includes:

S41: using Transformer Encoder to process the vector e ⁱ of each network traffic data packet respectively, so as to fuse the information of other network traffic data packets and obtain a new vector representation v ⁱ＝Transformer(eⁱ of each network traffic data packet);

S42: splicing the vector representation v ⁱ of each network flow data packet to obtain the characteristic representation h ₁＝Concat(v¹,v²,…,v^m)W_o of each network flow data packet; where W _o is the weight matrix of the neural network, concat represents stitching the two vectors.

S50: in the supplementary layer, the length information of all network flow data packets is taken as input, and the hidden characteristics of the length sequence of the data packets are learned by using a bidirectional LSTM (Long Short-Term Memory network);

in this step, the length information learning process of the network traffic packet specifically includes:

S51: extracting original length information of each network traffic data packet respectively, and constructing a length sequence ：L＝{l₁,l₂,…,l_m}＝{length(p¹),length(p²),…,length(p^m)};, wherein l _i represents the length of the data packet, p ⁱ represents the ith data packet, and length () represents the length information of the extracted data packet;

S52: the length sequence L is learned using bi-directional LSTM to obtain length information h₂＝Concat(LSTM→(l¹,l²,…,l^m),LSTM←(l¹,l²,…,l^m)); for each network traffic packet, where L ⁱ represents the length of the packet and Concat represents the concatenation of the two vectors.

S60: taking the characteristic representation h ₁ and the length information h ₂ of the network flow data packet as the input of a classification layer, and outputting the flow identification result of each network flow data packet by fusing the characteristic representation h ₁ and the length information h ₂ and classifying Softmax;

In this step, the fusion process of the classification layer pair feature representation h ₁ and the length information h ₂ specifically includes:

s61: full-join and Softmax classification of h ₁ in the temporal layer yields the predicted value γ ₁: And calculates the cross entropy loss function value loss ₁; wherein W, b is the parameter to be learned by the neural network.

S62: full ligation and Softmax classification of h ₂ in the supplementary layer resulted in the predicted value γ ₂: and calculates the cross entropy loss function value loss ₂;

S63: splicing the h ₁、h₁ to obtain h ₃＝Concat(h₁,h₂);

S64: full ligation and Softmax classification of h ₃ gave the predicted value γ ₃: And calculates the cross entropy loss function value loss ₃;

s65: calculate the sum of three loss function values And updating network parameters by adopting a gradient descent algorithm according to the calculation result.

In the above, in the embodiment of the present application, the loss function value is calculated once for each part (h ₁、h₂、h₃) in the classification layer, and the network parameters are updated by gradient descent with the sum of the three loss function values, so as to ensure that the neural network learns the original byte information and the length information of the data packet, and improve the network performance.

Based on the above, the encryption network traffic identification method of the embodiment of the application can well learn the relevance among different data packets by constructing the end-to-end encryption traffic identification model, and at the pre-training layer of the model, by an unsupervised pre-training method of randomly masking part of the byte content of the data packets and recovering through a transducer, better express the network traffic bytes and achieve better data packet coding capability; the length information of the data packet is supplemented in the data packet coding layer, so that the loss of the length information in the cutting or filling stage is prevented, and the integrity of the data packet information is kept as much as possible; in the classification layer, the calculation of a loss function value is respectively carried out on the characteristic representation in the time sequence layer, the length information in the supplementary layer and the fusion of the characteristic representation and the length information, and the gradient descent update of the network parameters is carried out by using the sum of the three loss function values, so that the neural network can learn the original byte information and the length information of the data packet, and the network performance is improved. The application uses an end-to-end strategy, does not need to additionally perform operations such as characteristic engineering and the like, and achieves better encryption flow identification effect while maintaining the information integrity of the data packet.

Fig. 4 is a schematic structural diagram of an encrypted network traffic identification system according to an embodiment of the application. The encrypted network traffic identification system 40 of the embodiment of the present application includes:

Pretraining module 41: the method comprises the steps of acquiring network traffic data packets, learning the relevance among byte contents of each network traffic data packet by using Transformer Encoder, and training a neural network model with data packet coding capability;

packet encoding module 42: encoding byte content of the network traffic data packet by the neural network model;

feature learning module 43: the method comprises the steps of using a transducer to learn time sequence relations among all encoded network traffic data packets and obtaining characteristic representations of all network traffic data packets;

Length learning module 44: learning length information of each network traffic packet using a bidirectional LSTM;

fusion and classification module 45: and the method is used for fusing the characteristic representation and the length information of each network flow data packet and classifying Softmax to obtain the flow identification result of each network flow data packet.

Fig. 5 is a schematic diagram of a terminal structure according to an embodiment of the application. The terminal 50 includes a processor 51, a memory 52 coupled to the processor 51.

The memory 52 stores program instructions for implementing the encrypted traffic identification method described above.

The processor 51 is operative to execute program instructions stored in the memory 52 to control encrypted traffic identification.

The processor 51 may also be referred to as a CPU (Central Processing Unit ). The processor 51 may be an integrated circuit chip with signal processing capabilities. Processor 51 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Fig. 6 is a schematic structural diagram of a storage medium according to an embodiment of the application. The storage medium of the embodiment of the present application stores a program file 61 capable of implementing all the methods described above, where the program file 61 may be stored in the storage medium in the form of a software product, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An encrypted traffic identification method, comprising:

Fusing the characteristic representation and the length information of each network flow data packet and classifying Softmax to obtain the flow identification result of each network flow data packet;

The training of a neural network model with data packet encoding capability includes:

2. The encrypted traffic recognition method according to claim 1, wherein the training of a neural network model with data packet encoding capability is preceded by:

And constructing an encryption traffic identification model, wherein the encryption traffic identification model comprises a pre-training layer, a data packet coding layer, a time sequence layer, a supplement layer and a classification layer, and training the neural network at the pre-training layer of the encryption traffic identification model.

3. The encrypted traffic recognition method according to claim 2, wherein the encoding byte content of the network traffic data packet by the neural network model comprises:

4. The encrypted traffic recognition method according to claim 3, wherein the learning the timing relationship between the encoded individual network traffic packets using a transducer includes:

5. The encrypted traffic recognition method according to claim 4, wherein the learning the length information of each network traffic packet using the bidirectional LSTM comprises:

The length sequence L is learned using a bi-directional LSTM to obtain length information h₂＝Concat(LSTM→(l₁,l₂,…,l_m),LSTM←(l₁,l₂,…,l_m)), for each network traffic packet, where L _i represents the length of the packet and Concat represents the concatenation of the two vectors.

6. The encrypted traffic recognition method according to claim 5, wherein the fusing the characteristic representation and the length information of the respective network traffic packets and Softmax classification comprises:

At a classification layer of the encrypted traffic identification model, performing full connection and Softmax classification on h ₁ in the time sequence layer to obtain a predicted value gamma ₁: And calculates the cross entropy loss function value loss ₁; wherein/> Is a parameter to be learned by the neural network;

Full ligation and Softmax classification of h ₂ in the supplementary layer resulted in the predicted value γ ₂: And calculates the cross entropy loss function value loss ₂; wherein/> Is a parameter to be learned by the neural network;

splicing the h ₁、h₁ to obtain h ₃＝Concat(h₁,h₂);

full ligation and Softmax classification of h ₃ gave the predicted value γ ₃: And calculates the cross entropy loss function value loss ₃; wherein/> Is a parameter to be learned by the neural network;

7. An encrypted traffic identification system using the encrypted traffic identification method according to claim 1, comprising:

8. A terminal comprising a processor, a memory coupled to the processor, wherein,

The memory stores program instructions for implementing the encrypted traffic identification method according to any one of claims 1 to 6;

9. A storage medium storing program instructions executable by a processor for performing the encrypted traffic identification method according to any one of claims 1 to 6.