CN115883263A - Encryption application protocol type identification method based on multi-scale load semantic mining - Google Patents

Encryption application protocol type identification method based on multi-scale load semantic mining Download PDF

Info

Publication number
CN115883263A
CN115883263A CN202310189712.1A CN202310189712A CN115883263A CN 115883263 A CN115883263 A CN 115883263A CN 202310189712 A CN202310189712 A CN 202310189712A CN 115883263 A CN115883263 A CN 115883263A
Authority
CN
China
Prior art keywords
sequence
features
load
characteristic
application protocol
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310189712.1A
Other languages
Chinese (zh)
Other versions
CN115883263B (en
Inventor
吉庆兵
谈程
罗杰
潘炜
康璐
倪绿林
尹浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 30 Research Institute
Original Assignee
CETC 30 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 30 Research Institute filed Critical CETC 30 Research Institute
Priority to CN202310189712.1A priority Critical patent/CN115883263B/en
Publication of CN115883263A publication Critical patent/CN115883263A/en
Application granted granted Critical
Publication of CN115883263B publication Critical patent/CN115883263B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides an encryption application protocol type identification method based on multi-scale load semantic mining, which comprises the following steps: step 1, extracting load characteristics of original flow and converting the load characteristics into a decimal byte sequence; step 2, constructing a pyramid neural network based on a load semantic mining block, and processing a decimal byte sequence to obtain an input characteristic sequence; step 3, the load semantic mining block constructs a sliding window on the input feature sequence, the sliding window sequentially moves to the tail end of the sequence, and features extracted from the splicing window obtain features of the input sequence; step 4, reducing the dimension of the features of the input sequence to be used as a new input sequence, repeating the step 3 to the step 4, and splicing the features obtained each time to obtain multi-scale features; and 5, finishing classification of the encryption network application protocol types according to the multiple scale characteristics. The invention can extract the multi-scale characteristics in the encryption network application protocol message in a complex scene, and improve the speed and the precision of encryption flow identification.

Description

Encryption application protocol type identification method based on multi-scale load semantic mining
Technical Field
The invention relates to the field of flow analysis, in particular to an encryption application protocol type identification method based on multi-scale load semantic mining.
Background
Traffic classification has been used in a very wide range of applications, and is the basis for network security and network management, and detection of traffic classification from QoS services in network service providers to security applications in firewalls and intrusion detection systems has not been isolated. At present, the traffic classification mainly adopts methods based on port number, deep packet inspection, machine learning and the like, but has certain defects:
(1) Traditional port number-based approaches have long failed because newer applications either use well-known port numbers to mask their traffic or do not use standard registered port numbers.
(2) Deep packet inspection relies on finding keys in the packets, which fails in the face of encrypted traffic.
(3) Machine learning based methods of encrypted network traffic identification rely heavily on ergonomic features, which limits their popularity.
With the popularization of deep learning methods, researchers have studied the effects of these methods on traffic classification tasks and demonstrated higher accuracy on early mobile application traffic data sets. With the continuous upgrading of encryption protocols, the explosive growth of the number of mobile applications and the change of mobile application development modes, shallow deep learning models cannot meet the actual requirements of mobile application traffic identification in the current complex scene. Although the currently proposed transform-based encrypted traffic identification method has a good effect on feature learning, global features are more concerned in the feature extraction process, detail features hidden in high-fraction load data are ignored, and the local features are the key for realizing accurate classification in many cases.
Disclosure of Invention
In order to solve the problems that deep features in encrypted flow cannot be learned by a shallow neural network under the current complex scene and the existing deep neural network excessively focuses on global features to cause loss of detail features, the invention provides a new encryption network application protocol type identification method, which fully utilizes the global features and the local detail features of different scales in packet loads by extracting the features of different scales, thereby improving the identification precision.
The technical scheme adopted by the invention is as follows: the encryption application protocol type identification method based on multi-scale load semantic mining comprises the following steps:
step 1, preprocessing original flow of a mobile application encryption network, extracting load characteristics of a transmission layer load, and converting the load characteristics into a decimal byte sequence;
step 2, constructing a pyramid neural network based on a load semantic mining block, and acquiring a word embedding characteristic and a position coding characteristic of a decimal byte sequence, wherein an input characteristic sequence is obtained by adding the word embedding characteristic and the position coding characteristic;
step 3, the load semantic mining block constructs a sliding window on the input feature sequence, the sliding window moves in sequence until the tail end of the input sequence, the features in the sliding window during each movement are extracted, and the features extracted in all the sliding windows are spliced in sequence to obtain the features of the input sequence;
step 4, performing feature compression and dimension reduction on the features of the input sequence to serve as a new input sequence, repeating the steps 3-4 k, and splicing the features of the input sequence obtained each time to obtain the multi-scale features of the input sequence;
and 5, finishing classification of the encryption network application protocol types according to the multiple scale characteristics.
Further, the pretreatment process in the step 1 is as follows:
step 1.1, dividing the data packet into session flows according to quintuple;
step 1.2, cleaning the session stream, and removing the data packet retransmitted overtime, the data packet of the address resolution protocol and the data packet of the dynamic host configuration protocol;
step 1.3, extracting load characteristics of a transmission layer load in a data packet, and splicing the extracted load characteristics according to the arrival sequence of the data packet until the byte length after splicing reaches the set load characteristic length;
and 1.4, converting the extracted spliced load characteristics into a decimal byte sequence.
Further, in step 1.3, if the byte length after the payload features of all the packets in the session stream are spliced is still smaller than the set payload feature length, the packet is padded with 0X 00.
Further, in the step 2, mapping byte features of the decimal byte sequence to a d-dimensional vector space to obtain word embedding features F1,
Figure SMS_1
where R represents a real number in the matrix.
Further, in the step 2, the position coding feature calculating method includes:
Figure SMS_2
(1)
Figure SMS_3
(2)
Figure SMS_4
(3)
where pos denotes the position where the byte appears in the byte sequence, left side of formula (1)
Figure SMS_6
Position coding which indicates bytes in even positions, left-based or (2)>
Figure SMS_9
Indicates the position-coding of the byte in the odd position, based on the value of the flag>
Figure SMS_11
I is a position-coded dimension subscript modulo 2, and (1) indicates that even positions are->
Figure SMS_7
And (2) represents the odd number position is based on->
Figure SMS_8
,/>
Figure SMS_10
For the position-coded dimension, <' > H>
Figure SMS_12
For position-coding features, in formula (3)>
Figure SMS_5
Indicating the position code of each byte in the byte sequence.
Further, the substep of step 3 comprises:
step 3.1, constructing a sliding window with the size of L bytes on the input sequence;
step 3.2, performing feature extraction on the data in the sliding window by adopting a multi-head attention mechanism to obtain a feature F4;
step 3.3, carrying out residual error connection and layer normalization processing on the input sequence F3 and the characteristic F4 to obtain a characteristic F5;
step 3.4, performing two-layer full-connection layer operation on the characteristic F5 to obtain a characteristic F6;
step 3.5, carrying out residual error connection and layer normalization processing on the characteristic F5 and the characteristic F6 to obtain a characteristic F7;
step 3.6, moving the sliding window backwards by L bytes, and repeating the step 3.2 to the step 3.6 until the sliding window moves to the tail end of the input sequence;
and 3.7, splicing the features F7 in all the sliding windows to obtain a feature F8 which is used as the feature of the input sequence.
Further, the substeps of step 3.2 are:
step 3.2.1, performing multi-head self-attention calculation on the data in the sliding window, and extracting the incidence relation of byte sequences in the window;
and 3.2.2, repeating the step 3.2.1 for M times according to the set attention head number M, and splicing and linearly converting the extracted result every time to obtain the characteristic F4 of the data in the sliding window.
Further, in step 4, a one-dimensional maximum pooling layer is used to complete feature compression and dimension reduction, and each pooling operation halves the dimension of the first dimension of the feature.
Further, the substep of step 5 comprises:
step 5.1, inputting the extracted multi-scale features into a full connection layer and an activation function, wherein the output dimension is consistent with the quantity of flow categories;
and 5.2, calculating the type of the encrypted network application protocol according to the output.
Further, in the step 5.2, a specific calculation method of the category is as follows:
Figure SMS_13
where Z represents the output of the multi-scale feature input fully-connected layer and the activation function.
Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows:
1. the pyramid network constructed based on the load semantic mining block can extract multi-scale features in the message type of the encryption network application protocol in the current complex scene, fully extract global features and multi-scale local features, and further improve the accuracy of encryption flow identification.
2. When the local features are extracted, a sliding window mode is adopted, each self-attention calculation is carried out in the range covered by the window, noise is avoided being introduced when the local features are extracted, meanwhile, model parameters are greatly reduced, and the calculation speed of the model is improved.
3. Learning and classifying based on load data above a transmission layer in the network traffic data, and having strong generalization capability without depending on IP address and port number information of the head of a network traffic data packet; strong identification information such as an IP address and port number information of a header of a network traffic data packet does not have universality, and may cause strong interference to a final identification result.
Drawings
Fig. 1 is a flowchart of an encryption application protocol type identification method based on multi-scale load semantic mining according to the present invention.
Fig. 2 is a schematic diagram of a pyramid network model structure according to an embodiment of the present invention.
FIG. 3 is a flowchart illustrating an implementation of a sliding window according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of multi-scale feature extraction according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar modules or modules having the same or similar functionality throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application. On the contrary, the embodiments of the application include all changes, modifications and equivalents coming within the spirit and terms of the claims appended hereto.
Aiming at the problems that deep level features in encrypted flow cannot be learned by a shallow neural network in the prior art under the current complex scene and the existing deep neural network excessively focuses on global features to cause loss of detail features, the embodiment provides an encryption application protocol type identification method for mining deep neural networks to extract multi-scale features based on load semantics, the features of different scales are extracted, the global features in packet loads and the local detail features of different scales are fully utilized, the identification precision is further improved, meanwhile, the local features are extracted in a sliding window mode, self-attention calculation is limited in a window range, model parameters are reduced, and the calculation speed of a model is improved, and the specific scheme is as follows:
as shown in fig. 1, the method for identifying the type of the encryption application protocol based on the multi-scale load semantic mining includes:
step 1, preprocessing original flow of a mobile application encryption network, extracting load characteristics of a transmission layer load, and converting the load characteristics into a decimal byte sequence;
step 2, building a pyramid neural network based on the load semantic mining block; acquiring a word embedding characteristic and a position coding characteristic of the decimal byte sequence, and adding the word embedding characteristic and the position coding characteristic to obtain an input characteristic sequence;
step 3, the load semantic mining block constructs a sliding window on the input feature sequence, the sliding window moves in sequence until the end of the input sequence, the features in the sliding window during each movement are extracted, and the features extracted in all the sliding windows are spliced in sequence to obtain the features of the input sequence;
step 4, performing feature compression and dimension reduction on the features of the input sequence to serve as a new input sequence, repeating the steps 3-4 k, and splicing the features of the input sequence obtained each time to obtain the multi-scale features of the input sequence;
and 5, finishing classification of the encryption network application protocol types according to the multiple scale characteristics.
Since the pre-identification information such as the IP address and the port number information of the network traffic packet header has no universality and may cause strong interference to the identification result, in this embodiment, the learning and classification are performed based on the information and the data above the network traffic data transmission layer, and do not depend on the IP address and the port number of the network traffic packet header.
Before parsing, the original flow needs to be preprocessed, specifically:
step 1.1, dividing the received data packet into session flows according to a quintuple (source IP, destination IP, source port, destination port, transport layer protocol), and identifying the flow by taking the session flows as a unit.
Step 1.2, because the received data packets contain data packets irrelevant to the specific flow of the transmission content, the session stream needs to be cleaned, and the data packets retransmitted overtime, the data packets of an Address Resolution Protocol (ARP) and a Dynamic Host Configuration Protocol (DHCP) are removed. In this example, the washing was accomplished using the Tsharp tool from WireShark.
And step 1.3, after removing the irrelevant data packets, extracting the load characteristics of the transmission layer loads of the remaining data packets, and extracting the load characteristics of the transmission layer loads according to the arrival sequence of the data packets for splicing until the extracted byte length reaches the set load characteristic length N. It should be noted that, in this embodiment, if the length of the concatenated bytes of the payload features of all the packets in the session stream is smaller than N, padding is performed by 0X 00.
Preferably, the present embodiment uses the rdpcap method of the Scapy tool to extract the load characteristics of the transport layer load.
And step 1.4, converting the extracted and spliced binary load characteristics into a decimal byte sequence, namely converting each byte into a corresponding decimal number (0 to 255).
After the decimal byte sequence representing the transmission layer characteristics is obtained, the analysis of the traffic type can be started, and in this embodiment, the features of different scales in the payload (decimal byte sequence) are extracted by using the constructed Pyramid-type neural network (Pyramid-Transformer).
The current encryption flow identification model based on a transform (a deep learning framework) uses a self-attention mechanism to extract global features more, neglects extraction of local features, and the local features may be a key for realizing fine classification, and meanwhile, the local features have a phenomenon of inconsistent scales, and interference may exist in the extraction process.
As shown in fig. 2 and 4, in this embodiment, a Pyramid-type neural network (Pyramid-Transformer) constructed based on a plurality of load semantic mining blocks (Pyramid Transformer blocks) is proposed in step 2, and a one-dimensional maximum pooling layer is arranged between each load semantic mining block, so as to implement compression and dimension reduction in the feature extraction process. Each load semantic mining block has the same composition and comprises six parts of multi-head attention calculation, residual connection, layer normalization, two layers of fully-connected layers and activation functions, residual connection and layer normalization which are sequentially connected. Extracting deep multi-scale features in a stacking mode of a plurality of load semantic mining blocks, compressing feature dimensions to 1/2 after each load semantic mining block extracts the features, inputting the compressed features into the next load semantic mining block without changing the window size, extracting features with larger dimensions in the mode, gradually reducing the feature dimensions extracted by each load semantic mining block to form a pyramid shape, and splicing the features to obtain the final features.
The process of realizing feature extraction by the pyramid type neural network is specifically explained as follows:
in the pyramid type neural network, feature extraction is mainly completed through a load semantic mining block, and the input of the load semantic mining block is the combination of word embedding features and position coding features of a byte sequence, so that a decimal byte sequence needs to be processed firstly.
For byte sequence (in FIG. 2, FIG. 4, areB1、B2、…、BN-1、BN-2) Performing word embedding operation, mapping the byte features to a d-dimensional vector space to obtain word embedding features F1 as subsequent input,
Figure SMS_14
where R represents a real number in the matrix.
Calculating position coding characteristics of byte sequence
Figure SMS_15
,/>
Figure SMS_16
Where R represents a real number in the matrix:
Figure SMS_17
(1)
Figure SMS_18
(2)
Figure SMS_19
(3)
where pos denotes the position where the byte appears in the byte sequence, left side of formula (1)
Figure SMS_20
Indicates the position coding of the byte in the even position, and (2) left @>
Figure SMS_21
Position-coding, which represents bytes in odd positions>
Figure SMS_22
I is a position-coded dimension subscript modulo 2, and (1) indicates that even positions are->
Figure SMS_23
And (2) represents the odd number position is based on->
Figure SMS_24
,/>
Figure SMS_25
A dimension that encodes a position; (3) In the formula>
Figure SMS_26
Indicating the position code of each byte in the byte sequence. Since the Transformer uses global information and cannot utilize the order information of bytes, which is very important for feature learning, the present embodiment acquires the position encoding feature.
Combining the word embedding characteristic and the position coding characteristic according to the formula (4) to obtain the input characteristic of the load semantic mining block
Figure SMS_27
,/>
Figure SMS_28
Where R represents a real number in the matrix.
Figure SMS_29
(4)
After determining the input of the load semantic mining block, feature extraction can be performed through the load semantic mining block, which specifically includes:
step 3.1, because some detail features only exist on a small number of adjacent bytes, the direct feature extraction of the whole input sequence may cause interference to the local detail features, and the sliding window mode is used to ensure that the high-resolution local detail features are not damaged. Thus in the input feature
Figure SMS_30
And constructing a sliding window with the size of L, and performing feature extraction on data inside the window as shown in FIG. 3.
Step 3.2, acquiring data in the sliding window as
Figure SMS_31
,/>
Figure SMS_32
The multi-head attention machine is adopted to control the paired judgment and judgment>
Figure SMS_33
Performing feature extraction to obtain the feature->
Figure SMS_34
,/>
Figure SMS_35
,/>
Figure SMS_36
Contains global dependencies of bytes within a window, and what is obtained here from the point of view of the entire byte sequence is a local feature within the window.
The specific process comprises the following steps:
step 3.2.1, pair
Figure SMS_37
Performing multi-head self-attention calculation, and extracting the association relation of byte sequences in a window:
using a weight matrix
Figure SMS_38
,/>
Figure SMS_39
,/>
Figure SMS_40
Counting feature>
Figure SMS_41
Is/are>
Figure SMS_42
The calculation process is shown as formula (5), formula (6) and formula (7):
Figure SMS_43
(5)
Figure SMS_44
(6)
Figure SMS_45
(7)
by passing
Figure SMS_46
The matrix operation of (a) implements a self-Attention mechanism (Attention), resulting in an output @>
Figure SMS_47
,/>
Figure SMS_48
Figure SMS_49
(8)
wherein ,
Figure SMS_51
is->
Figure SMS_55
The number of columns of the matrix, i.e. the vector dimension, and->
Figure SMS_58
In the same way>
Figure SMS_52
Is a matrix transposition. The calculation matrix in the formula->
Figure SMS_54
and />
Figure SMS_57
The inner product of each row vector is divided by ^ to prevent the inner product from being too large>
Figure SMS_59
. After Q is multiplied by the transpose of K, the number of rows and columns of the obtained matrix is L, where L is the window size, and this matrix can represent the strength of association between bytes. Get->
Figure SMS_50
Thereafter, use->
Figure SMS_53
The function (normalized exponential function) calculates the self-attention coefficient of each byte for the other bytes, and->
Figure SMS_56
Each row of the matrix is normalized, i.e. the sum of each row becomes 1.
Step 3.2.2, setting the number M of attention heads, repeating the step 3.2.1M times to obtain M output Z, splicing and linearly transforming the M output Z to obtain characteristics
Figure SMS_60
,/>
Figure SMS_61
Figure SMS_62
wherein ,
Figure SMS_63
represents the output of the first calculation, is greater than or equal to>
Figure SMS_64
Indicates the fifth->
Figure SMS_65
The sub-calculated output->
Figure SMS_66
Weight matrix representing a linear transformation>
Figure SMS_67
Step 3.3, for
Figure SMS_68
and />
Figure SMS_69
Performing residual connection and layer normalization to obtain a characteristic->
Figure SMS_70
Figure SMS_71
(9)
Wherein LayerNorm indicates layer normalization operation.
Step 3.4, the
Figure SMS_72
Performing a Forward propagation (Feed Forward) operation to obtain the characteristic->
Figure SMS_73
,/>
Figure SMS_74
Figure SMS_75
(10)
Wherein, linear represents that one full connection layer operation is carried out; feed Forward consists of two fully connected layers, the first layer using the activation function RELU and the second layer not using the activation function.
Step 3.5, for
Figure SMS_76
and />
Figure SMS_77
Performing residual connection and layer normalization to obtain a characteristic->
Figure SMS_78
,/>
Figure SMS_79
Figure SMS_80
(11)
Step 3.6, moving the sliding window backwards by L bytes, and re-executing the step 3.2 to the step 3.5 in the new window until the sliding window moves to the input characteristic
Figure SMS_81
Ending;
step 3.7, feature obtained in each sliding window
Figure SMS_82
Spliced to get->
Figure SMS_83
,/>
Figure SMS_84
:/>
Figure SMS_85
(12)
wherein ,
Figure SMS_86
characteristic representing the result of the first window>
Figure SMS_87
,/>
Figure SMS_88
Indicates the characteristic taken by the last window>
Figure SMS_89
In order to extract the multi-scale features of the byte sequence, in step 4 of this embodiment, a one-dimensional maximum pooling layer pair of features is first adopted
Figure SMS_90
Feature compression and dimension reduction to obtain features->
Figure SMS_91
,/>
Figure SMS_92
Figure SMS_93
(13)
Wherein, maxPool1d represents one-dimensional maximum pooling operation, the pooling operation halves the dimension of the first dimension of the feature, and meanwhile, the new feature has richer semantic information.
And setting repetition times k according to the requirement, and repeating the steps 3-4 k times except for the input characteristic in the process of executing the step 3 for the first time
Figure SMS_94
In addition, when step 3 is subsequently executed, the characteristic obtained in the last step 4 is->
Figure SMS_95
As an input for this time.
Features obtained by each repeated execution
Figure SMS_96
Splicing to obtain characteristics>
Figure SMS_97
Figure SMS_98
(14)
As shown in fig. 4, the repeated operation represents that the load semantic mining blocks of the pyramid network model are stacked multiple times, and the features of deeper higher semantics are extracted layer by layer, in fig. 4, the feature dimension is represented by N, d, N is the same as the length of the input byte sequence, and d is the same as the dimension of each byte extension after the word embedding operation. Wherein,
Figure SMS_99
representing a characteristic taken by a first repeated operation>
Figure SMS_100
,/>
Figure SMS_101
Representing a characteristic taken by the kth operation>
Figure SMS_102
The characteristics obtained at this time
Figure SMS_103
I.e. a multi-scale feature in the required load. After obtaining the multi-scale features, the flow classification can be carried out:
in this embodiment, the classification process specifically includes:
step 5.1, extracting the multi-scale features
Figure SMS_104
Input full connection layer and activation function>
Figure SMS_105
Dimension of output and number of traffic classes->
Figure SMS_106
And (5) the consistency is achieved.
Figure SMS_107
(15)
wherein ,
Figure SMS_108
a weight matrix representing a fully connected layer, <' >>
Figure SMS_109
,/>
Figure SMS_110
And 5.2, calculating and outputting the type of the encrypted network application protocol:
Figure SMS_111
in the embodiment, a deep neural network, namely a pyramid neural network is constructed, and the network stacks load semantic mining blocks, so that deep features in an encryption protocol message type in a current complex scene can be extracted, and the accuracy of flow identification is improved.
It should be noted that, in the description of the embodiments of the present invention, unless otherwise explicitly specified or limited, the terms "disposed" and "connected" should be interpreted broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; may be directly connected or may be indirectly connected through an intermediate. The specific meanings of the above terms in the present invention can be understood as specific cases to those of ordinary skill in the art; the drawings in the embodiments are used for clearly and completely describing the technical scheme in the embodiments of the invention, and obviously, the described embodiments are a part of the embodiments of the invention, but not all of the embodiments. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. The encryption application protocol type identification method based on multi-scale load semantic mining is characterized by comprising the following steps:
step 1, preprocessing original flow of a mobile application encryption network, extracting load characteristics of a transmission layer load, and converting the load characteristics into a decimal byte sequence;
step 2, constructing a pyramid neural network based on a load semantic mining block, and acquiring a word embedding characteristic and a position coding characteristic of a decimal byte sequence, wherein an input characteristic sequence is obtained by adding the word embedding characteristic and the position coding characteristic;
step 3, the load semantic mining block constructs a sliding window on the input feature sequence, the sliding window moves in sequence until the tail end of the input sequence, the features in the sliding window during each movement are extracted, and the features extracted in all the sliding windows are spliced in sequence to obtain the features of the input sequence;
step 4, performing feature compression and dimension reduction on the features of the input sequence to serve as a new input sequence, repeating the steps 3-4 k times, and splicing the features of the input sequence obtained in the step 3 repeatedly each time to obtain multi-scale features of the input sequence;
and 5, finishing classification of the encryption network application protocol types according to the multiple scale characteristics.
2. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 1, wherein the preprocessing process in the step 1 is as follows:
step 1.1, dividing the data packet into session flows according to quintuple;
step 1.2, cleaning the session stream, and removing the data packet retransmitted overtime, the data packet of the address resolution protocol and the data packet of the dynamic host configuration protocol;
step 1.3, extracting load characteristics of transmission layer loads in the data packets, and splicing the extracted load characteristics according to the arrival sequence of the data packets until the byte length after splicing reaches the set load characteristic length;
and step 1.4, converting the extracted spliced load characteristics into a decimal byte sequence.
3. The encryption application protocol type identification method based on multi-scale load semantic mining as claimed in claim 2, wherein in step 1.3, if the byte length of the spliced load features of all the data packets in the session stream is still smaller than the set load feature length, the byte length is padded with 0X 00.
4. The encryption application protocol type recognition method based on multi-scale load semantic mining according to claim 1 or 2, characterized in that in the step 2, byte features of decimal byte sequence are mapped to d-dimensional vector space to obtain word embedding features F1,
Figure QLYQS_1
where R represents a real number in the matrix.
5. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 4, wherein in the step 2, the position coding feature calculation method is as follows:
Figure QLYQS_2
(1)
Figure QLYQS_3
(2)
Figure QLYQS_4
(3)
where pos denotes the position where the byte appears in the byte sequence, left side of formula (1)
Figure QLYQS_6
Position coding which indicates bytes in even positions, left-based or (2)>
Figure QLYQS_9
Indicates the position-coding of the byte in the odd position, based on the value of the flag>
Figure QLYQS_11
I is a position-coded dimension subscript modulo 2, and (1) indicates that even positions are->
Figure QLYQS_7
And (2) represents the odd number position is based on->
Figure QLYQS_8
,/>
Figure QLYQS_10
For the position-coded dimension, <' > H>
Figure QLYQS_12
For position-coding features, in formula (3)>
Figure QLYQS_5
Indicating the position code of each byte in the byte sequence.
6. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 1, wherein the substep of step 3 comprises:
step 3.1, constructing a sliding window with the size of L bytes on the input characteristic sequence;
step 3.2, performing feature extraction on the data in the sliding window by adopting a multi-head attention mechanism to obtain a feature F4;
step 3.3, carrying out residual error connection and layer normalization processing on the input sequence F3 and the characteristic F4 to obtain a characteristic F5;
step 3.4, performing two-layer full-connection layer operation on the characteristic F5 to obtain a characteristic F6;
step 3.5, carrying out residual error connection and layer normalization processing on the characteristic F5 and the characteristic F6 to obtain a characteristic F7;
step 3.6, moving the sliding window backwards by L bytes, and repeating the step 3.2 to the step 3.6 until the sliding window moves to the tail end of the input sequence;
and 3.7, splicing the features F7 in all the sliding windows to obtain a feature F8 which is used as the feature of the input sequence.
7. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 6, wherein the substep of the step 3.2 is:
step 3.2.1, performing multi-head self-attention calculation on the data in the sliding window, and extracting the association relation of byte sequences in the window;
and 3.2.2, repeating the step 3.2.1 for M times according to the set attention head number M, and splicing and linearly converting the extracted result every time to obtain the characteristic F4 of the data in the sliding window.
8. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 1, characterized in that in the step 4, a one-dimensional maximum pooling layer is adopted to complete feature compression and dimension reduction, and the dimension of the first dimension of the feature is halved for each pooling operation.
9. The encryption application protocol type identification method based on multiscale load semantic mining according to claim 1, wherein the substep of the step 5 comprises:
step 5.1, inputting the extracted multi-scale features into a full connection layer and an activation function, wherein the output dimension is consistent with the quantity of flow categories;
and 5.2, calculating the type of the encrypted network application protocol according to the output.
10. The encryption application protocol type identification method based on multi-scale load semantic mining according to claim 9, wherein in the step 5.2, the specific calculation method of the category is as follows:
Figure QLYQS_13
wherein ,
Figure QLYQS_14
representing classes, and Z representing the multi-scale feature input fully-connected layer and the output of the activation function. />
CN202310189712.1A 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining Active CN115883263B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310189712.1A CN115883263B (en) 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310189712.1A CN115883263B (en) 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining

Publications (2)

Publication Number Publication Date
CN115883263A true CN115883263A (en) 2023-03-31
CN115883263B CN115883263B (en) 2023-05-09

Family

ID=85761794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310189712.1A Active CN115883263B (en) 2023-03-02 2023-03-02 Encryption application protocol type identification method based on multi-scale load semantic mining

Country Status (1)

Country Link
CN (1) CN115883263B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104052749A (en) * 2014-06-23 2014-09-17 中国科学技术大学 Method for identifying link-layer protocol data types
CN104506484A (en) * 2014-11-11 2015-04-08 中国电子科技集团公司第三十研究所 Proprietary protocol analysis and identification method
CN105430021A (en) * 2015-12-31 2016-03-23 中国人民解放军国防科学技术大学 Encrypted traffic identification method based on load adjacent probability model
EP3111612A1 (en) * 2014-02-28 2017-01-04 British Telecommunications Public Limited Company Profiling for malicious encrypted network traffic identification
US20180115567A1 (en) * 2015-03-17 2018-04-26 British Telecommunications Public Limited Company Learned profiles for malicious encrypted network traffic identification
CN110532564A (en) * 2019-08-30 2019-12-03 中国人民解放军陆军工程大学 Application layer protocol online identification method based on CNN and LSTM mixed model
CN111211948A (en) * 2020-01-15 2020-05-29 太原理工大学 Shodan flow identification method based on load characteristics and statistical characteristics
CN112163594A (en) * 2020-08-28 2021-01-01 南京邮电大学 Network encryption traffic identification method and device
CN112511555A (en) * 2020-12-15 2021-03-16 中国电子科技集团公司第三十研究所 Private encryption protocol message classification method based on sparse representation and convolutional neural network
CN113949653A (en) * 2021-10-18 2022-01-18 中铁二院工程集团有限责任公司 Encryption protocol identification method and system based on deep learning
CN114358118A (en) * 2021-11-29 2022-04-15 南京邮电大学 Multi-task encrypted network traffic classification method based on cross-modal feature fusion
WO2022094926A1 (en) * 2020-11-06 2022-05-12 中国科学院深圳先进技术研究院 Encrypted traffic identification method, and system, terminal and storage medium
CN115277888A (en) * 2022-09-26 2022-11-01 中国电子科技集团公司第三十研究所 Method and system for analyzing message type of mobile application encryption protocol
CN115348198A (en) * 2022-10-19 2022-11-15 中国电子科技集团公司第三十研究所 Unknown encryption protocol identification and classification method, device and medium based on feature retrieval
CN115348215A (en) * 2022-07-25 2022-11-15 南京信息工程大学 Encrypted network flow classification method based on space-time attention mechanism

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3111612A1 (en) * 2014-02-28 2017-01-04 British Telecommunications Public Limited Company Profiling for malicious encrypted network traffic identification
CN104052749A (en) * 2014-06-23 2014-09-17 中国科学技术大学 Method for identifying link-layer protocol data types
CN104506484A (en) * 2014-11-11 2015-04-08 中国电子科技集团公司第三十研究所 Proprietary protocol analysis and identification method
US20180115567A1 (en) * 2015-03-17 2018-04-26 British Telecommunications Public Limited Company Learned profiles for malicious encrypted network traffic identification
CN105430021A (en) * 2015-12-31 2016-03-23 中国人民解放军国防科学技术大学 Encrypted traffic identification method based on load adjacent probability model
CN110532564A (en) * 2019-08-30 2019-12-03 中国人民解放军陆军工程大学 Application layer protocol online identification method based on CNN and LSTM mixed model
CN111211948A (en) * 2020-01-15 2020-05-29 太原理工大学 Shodan flow identification method based on load characteristics and statistical characteristics
WO2022041394A1 (en) * 2020-08-28 2022-03-03 南京邮电大学 Method and apparatus for identifying network encrypted traffic
CN112163594A (en) * 2020-08-28 2021-01-01 南京邮电大学 Network encryption traffic identification method and device
WO2022094926A1 (en) * 2020-11-06 2022-05-12 中国科学院深圳先进技术研究院 Encrypted traffic identification method, and system, terminal and storage medium
CN112511555A (en) * 2020-12-15 2021-03-16 中国电子科技集团公司第三十研究所 Private encryption protocol message classification method based on sparse representation and convolutional neural network
CN113949653A (en) * 2021-10-18 2022-01-18 中铁二院工程集团有限责任公司 Encryption protocol identification method and system based on deep learning
CN114358118A (en) * 2021-11-29 2022-04-15 南京邮电大学 Multi-task encrypted network traffic classification method based on cross-modal feature fusion
CN115348215A (en) * 2022-07-25 2022-11-15 南京信息工程大学 Encrypted network flow classification method based on space-time attention mechanism
CN115277888A (en) * 2022-09-26 2022-11-01 中国电子科技集团公司第三十研究所 Method and system for analyzing message type of mobile application encryption protocol
CN115348198A (en) * 2022-10-19 2022-11-15 中国电子科技集团公司第三十研究所 Unknown encryption protocol identification and classification method, device and medium based on feature retrieval

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINHAI ZHANG: "Research on Key Technology of VPN Protocol Recognition" *
刘帅: "基于机器学习的加密流量识别研究与实现" *

Also Published As

Publication number Publication date
CN115883263B (en) 2023-05-09

Similar Documents

Publication Publication Date Title
CN109818930B (en) Communication text data transmission method based on TCP protocol
CN104918046B (en) A kind of local description compression method and device
CN112511555A (en) Private encryption protocol message classification method based on sparse representation and convolutional neural network
CN112702235B (en) Method for automatically and reversely analyzing unknown protocol
CN103955539B (en) Method and device for obtaining control field demarcation point in binary protocol data
WO2020207410A1 (en) Data compression method, electronic device, and storage medium
CN115473850B (en) AI-based real-time data filtering method, system and storage medium
CN112887291A (en) I2P traffic identification method and system based on deep learning
CN115277888B (en) Method and system for analyzing message type of mobile application encryption protocol
CN113037646A (en) Train communication network flow identification method based on deep learning
CN111130942B (en) Application flow identification method based on message size analysis
CN116975733A (en) Traffic classification system, model training method, device, and storage medium
CN108563795B (en) Pairs method for accelerating matching of regular expressions of compressed flow
CN110796182A (en) Bill classification method and system for small amount of samples
CN112383488B (en) Content identification method suitable for encrypted and non-encrypted data streams
CN113128626A (en) Multimedia stream fine classification method based on one-dimensional convolutional neural network model
CN115883263A (en) Encryption application protocol type identification method based on multi-scale load semantic mining
CN108573069B (en) Twins method for accelerating matching of regular expressions of compressed flow
CN114553790A (en) Multi-mode feature-based small sample learning Internet of things traffic classification method and system
CN113852605B (en) Protocol format automatic inference method and system based on relation reasoning
CN105938562B (en) A kind of automated network employing fingerprint extracting method and system
CN101262493B (en) Method for accelerating inter-network data transmission via stream buffer
CN114519390A (en) QUIC flow classification method based on multi-mode deep learning
US7657559B2 (en) Method to exchange objects between object-oriented and non-object-oriented environments
CN114048799A (en) Zero-day traffic classification method based on statistical information and payload coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant