CN112153045A

CN112153045A - Method and system for identifying encrypted field of private protocol

Info

Publication number: CN112153045A
Application number: CN202011014521.4A
Authority: CN
Inventors: 李青; 鞠永慧; 赵唱; 何鑫泰; 李光松
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-29
Anticipated expiration: 2040-09-24
Also published as: CN112153045B

Abstract

The invention provides a method and a system for identifying an encrypted field of a private protocol, which are used for acquiring a plurality of pieces of data to be detected; performing data preprocessing and grouping processing on all data to be detected to obtain p groups of preprocessed data sets; extracting bytes at the same position in the preprocessed data to construct reconstructed data aiming at each preprocessed data set to obtain a reconstructed data set; inputting all the reconstruction data sets into an identification model for clear and dense identification to obtain a clear and dense result distribution matrix; determining an encryption probability sequence according to the clear and secret result distribution matrix; determining a plurality of encryption field distribution modes by using an encryption probability sequence and a first filtering rule; and calculating the statistic of the data to be detected in each encryption field distribution mode aiming at each data to be detected, and determining the matched encryption field distribution mode of the data to be detected by combining a second filtering rule. The encrypted flow of the private protocol is identified through the method, the encrypted field distribution mode of the encrypted flow is determined, and the identification accuracy and the identification effect of the encrypted flow are improved.

Description

Method and system for identifying encrypted field of private protocol

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a system for identifying an encrypted field of a private protocol.

Background

With the development of the internet, ensuring the network security is one of the most important links in the network management construction. Hackers usually hide network attack behaviors by using an encryption protocol, so that encrypted traffic includes malicious traffic carrying worms, trojans and the like, and therefore encrypted traffic and unencrypted traffic need to be identified from all traffic, and then subsequent analysis is performed on the encrypted traffic.

At present, the way of identifying the encrypted traffic is a load randomness detection way, that is, randomness detection such as frequency detection, run detection, information entropy calculation and the like is performed on data effective loads. However, the load randomness detection method can only identify the encrypted traffic of the protocol data with known data specification, and for the private protocol data, because the data specification of the private protocol data is unknown, the load randomness detection method cannot accurately identify the encrypted traffic of the private protocol data, that is, the load randomness detection method has low accuracy and poor effect in identifying the encrypted traffic.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method for identifying an encrypted field of a private protocol, so as to solve the problems of low identification accuracy and poor effect in a load randomness detection manner.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

the first aspect of the embodiments of the present invention discloses a method for identifying an encrypted field of a private protocol, where the method includes:

acquiring a plurality of pieces of data to be detected, wherein the type of each piece of data to be detected is discrete sequence message data;

performing data preprocessing and grouping processing on all the data to be detected to obtain p groups of preprocessed data sets, wherein each group of preprocessed data sets comprises q preprocessed data, and p and q are positive integers;

for each group of the preprocessed data sets, extracting bytes at the same position in each preprocessed data of the preprocessed data sets to construct a piece of reconstructed data, and obtaining a reconstructed data set containing a plurality of pieces of reconstructed data;

inputting all the reconstruction data sets into a preset identification model for clear and dense identification to obtain a corresponding clear and dense result distribution matrix, wherein the identification model is obtained by training a neural network model according to plaintext sample data and ciphertext sample data;

determining an encryption probability sequence containing the encryption probability of each byte in all the reconstruction data sets according to the clear and secret result distribution matrix;

determining a plurality of encryption field distribution modes by using the encryption probability sequence and a preset first filtering rule;

and calculating statistics corresponding to each encrypted field distribution mode of the to-be-detected data according to each to-be-detected data, and determining the encrypted field distribution mode corresponding to the statistics meeting the preset matching requirement as the matched encrypted field distribution mode of the to-be-detected data by combining a preset second filtering rule.

Preferably, the determining, by using the encryption probability sequence and a preset first filtering rule, a plurality of encryption field distribution patterns includes:

determining a derivative of each encryption probability in the encryption probability sequence by using a preset derivative function;

combining the bytes corresponding to the encryption probability with the derivative larger than the derivative threshold value to obtain a plurality of initial encryption field distribution modes;

and filtering all the initial encryption field distribution modes by using a preset first filtering rule to obtain a plurality of final encryption field distribution modes.

Preferably, the calculating, for each piece of the to-be-detected data, statistics corresponding to the to-be-detected data in each encryption field distribution mode, and determining, in combination with a preset second filtering rule, that the encryption field distribution mode corresponding to the statistics meeting a preset matching requirement is a matching encryption field distribution mode of the to-be-detected data includes:

for each piece of data to be detected, determining an unencrypted field and an encrypted field of the data to be detected in each encrypted field distribution mode, and calculating a variance difference value between the unencrypted field and the encrypted field in each encrypted field distribution mode;

and determining the encryption field distribution mode corresponding to the largest variance difference value as a matched encryption field distribution mode of the data to be detected by combining a preset second filtering rule for each piece of data to be detected.

Preferably, the performing data preprocessing and grouping processing on all the data to be detected to obtain p groups of preprocessed data sets includes:

determining the data to be detected with the longest data length in all the data to be detected;

filling all the data to be detected to enable the data length of all the data to be detected to be consistent with the data length of the data to be detected with the longest data length;

normalizing all the data to be detected after filling processing to obtain a plurality of pieces of preprocessed data;

and randomly and uniformly dividing the plurality of pieces of preprocessing data to obtain p groups of preprocessing data sets.

Preferably, the inputting all the reconstruction data sets into a preset recognition model for clear and dense recognition to obtain a corresponding clear and dense result distribution matrix includes:

inputting all the reconstruction data sets into a preset identification model, and repeatedly performing the plaintext and ciphertext identification k times to obtain a plaintext and ciphertext result distribution matrix, wherein the plaintext and ciphertext result distribution matrix comprises plaintext and ciphertext identification result sub-matrices obtained by performing plaintext and ciphertext identification each time, and k is a positive integer;

correspondingly, the determining an encryption probability sequence containing the encryption probability of each byte in all the reconstruction data sets according to the plaintext result distribution matrix comprises:

aiming at each clear density identification result sub-matrix, calculating the average value of each column of data in the clear density identification result sub-matrix to obtain the initial encryption probability of each byte in all the reconstruction data sets;

and averaging all initial encryption probabilities of each byte in all the reconstruction data sets to obtain an encryption probability sequence of the final encryption probability of each byte in all the reconstruction data sets.

The second aspect of the embodiments of the present invention discloses a system for identifying an encrypted field of a private protocol, where the system includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of pieces of data to be detected, and the type of each piece of data to be detected is discrete sequence message data;

the processing unit is used for carrying out data preprocessing and grouping processing on all the data to be detected to obtain p groups of preprocessed data sets, each group of preprocessed data sets comprises q preprocessed data, and p and q are positive integers;

the constructing unit is used for extracting bytes at the same position in each piece of the preprocessed data of each group of the preprocessed data set to construct a piece of reconstructed data so as to obtain a reconstructed data set containing a plurality of pieces of the reconstructed data;

the identification unit is used for inputting all the reconstruction data sets into a preset identification model for clear and secret identification to obtain a corresponding clear and secret result distribution matrix, and the identification model is obtained by training a neural network model according to plaintext sample data and ciphertext sample data;

a first determining unit, configured to determine an encryption probability sequence including an encryption probability of each byte in all the reconstruction data sets according to the plaintext result distribution matrix;

the second determining unit is used for determining a plurality of encryption field distribution modes by utilizing the encryption probability sequence and a preset first filtering rule;

and the matching unit is used for calculating the statistic corresponding to each encrypted field distribution mode of the to-be-detected data according to each piece of to-be-detected data, and determining the encrypted field distribution mode corresponding to the statistic meeting the preset matching requirement as the matched encrypted field distribution mode of the to-be-detected data by combining a preset second filtering rule.

Preferably, the second determination unit includes:

the determining module is used for determining a derivative of each encryption probability in the encryption probability sequence by using a preset derivative function;

the combination module is used for combining the bytes corresponding to the encryption probability with the derivative larger than the derivative threshold value to obtain a plurality of initial encryption field distribution modes;

and the filtering module is used for filtering all the initial encryption field distribution modes by using a preset first filtering rule to obtain a plurality of final encryption field distribution modes.

Preferably, the matching unit includes:

the processing module is used for determining an unencrypted field and an encrypted field of the data to be tested in each encrypted field distribution mode and calculating a variance difference value between the unencrypted field and the encrypted field in each encrypted field distribution mode aiming at each piece of data to be tested;

and the matching module is used for determining the encryption field distribution mode corresponding to the largest variance difference value as the matching encryption field distribution mode of the data to be detected by combining a preset second filtering rule aiming at each piece of data to be detected.

Preferably, the processing unit includes:

the determining module is used for determining the data to be detected with the longest data length in all the data to be detected;

the processing module is used for filling all the data to be detected to enable the data length of all the data to be detected to be consistent with the data length of the data to be detected with the longest data length;

the normalization module is used for normalizing all the data to be detected after the filling processing to obtain a plurality of pieces of preprocessing data;

and the dividing module is used for randomly and uniformly dividing the plurality of pieces of preprocessing data to obtain p groups of preprocessing data sets.

Preferably, the identification unit is specifically configured to: inputting all the reconstruction data sets into a preset identification model, and repeatedly performing the plaintext and ciphertext identification k times to obtain a plaintext and ciphertext result distribution matrix, wherein the plaintext and ciphertext result distribution matrix comprises plaintext and ciphertext identification result sub-matrices obtained by performing plaintext and ciphertext identification each time, and k is a positive integer;

correspondingly, the first determining unit is specifically configured to: aiming at each clear density identification result sub-matrix, calculating the average value of each column of data in the clear density identification result sub-matrix to obtain the initial encryption probability of each byte in all the reconstruction data sets; and averaging all initial encryption probabilities of each byte in all the reconstruction data sets to obtain an encryption probability sequence of the final encryption probability of each byte in all the reconstruction data sets.

Based on the above method and system for identifying the encrypted field of the private protocol provided by the embodiments of the present invention, the method is: acquiring a plurality of pieces of data to be detected; performing data preprocessing and grouping processing on all data to be detected to obtain p groups of preprocessed data sets; for each group of preprocessed data sets, extracting bytes at the same position in each preprocessed data of the preprocessed data sets to construct a piece of reconstructed data, and obtaining a reconstructed data set containing a plurality of pieces of reconstructed data; inputting all the reconstruction data sets into a preset identification model for clear and dense identification to obtain a corresponding clear and dense result distribution matrix; determining an encryption probability sequence containing the encryption probability of each byte in all reconstruction data sets according to the clear and secret result distribution matrix; determining a plurality of encryption field distribution modes by using an encryption probability sequence and a preset first filtering rule; and calculating the statistic corresponding to the data to be detected in each encrypted field distribution mode aiming at each piece of data to be detected, and determining the encrypted field distribution mode corresponding to the statistic meeting the preset matching requirement as the matched encrypted field distribution mode of the data to be detected by combining with a preset second filtering rule. The encrypted flow of the private protocol is identified through the method, the encrypted field distribution mode of the encrypted flow is determined, and the identification accuracy and the identification effect of the encrypted flow are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of an identification method for an encrypted field of a private protocol according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of each preprocessed data set according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a recognition model according to an embodiment of the present invention;

FIG. 4 is a flow chart of determining a distribution pattern of a plurality of encrypted fields according to an embodiment of the present invention;

fig. 5 is a flowchart of determining a distribution pattern of a matching encrypted field of data to be tested according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of an encrypted field according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a training effect of a convolutional neural network model according to an embodiment of the present invention;

fig. 8 is a block diagram of a system for identifying an encrypted field of a private protocol according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

As can be seen from the background art, the current method for identifying encrypted traffic is a load randomness detection method, but for data of a private protocol with unknown data specification, the load randomness detection method cannot accurately identify the encrypted traffic of the data of the private protocol, and the problems of low accuracy and poor effect of identifying the encrypted traffic exist.

Therefore, an embodiment of the present invention provides a method for identifying an encrypted field of a private protocol, where data preprocessing and grouping are performed on all data to be detected to obtain p groups of preprocessed data sets, and for each preprocessed data set, bytes at the same position in the preprocessed data are extracted to construct reconstructed data, so as to obtain a reconstructed data set. And inputting all the reconstructed data sets into an identification model for clear and secret identification to obtain a clear and secret result distribution matrix, and determining an encryption probability sequence according to the clear and secret result distribution matrix. A plurality of encryption field distribution patterns are determined using the encryption probability sequence and the first filter rule. And calculating the statistic of the data to be detected in each encryption field distribution mode aiming at each data to be detected, and determining the matched encryption field distribution mode of the data to be detected by combining a second filtering rule, thereby identifying the encryption flow of the private protocol and determining the encryption field distribution mode of the encryption flow, and improving the identification accuracy and the identification effect of the encryption flow.

It should be noted that, for data of the private protocol, since the characteristics of the data of this type have the particularity of data dispersion and short length, the data of the private protocol is generally called as discrete sequence message data.

Referring to fig. 1, a flowchart of an identification method for an encrypted field of a private protocol according to an embodiment of the present invention is shown, where the identification method includes:

step S101: and acquiring a plurality of pieces of data to be detected.

It should be noted that each type of the data to be measured is discrete sequence message data.

It is understood that the plurality of pieces of data to be measured constitute a data set B to be measured, which is in the form of equation (1).

In the formula (1), the first and second groups,

is a byte vector obtained by converting the ith data to be measured into decimal data according to bytes, i.e.

j is more than or equal to 1 and less than or equal to ni, ni is the byte length of the ith piece of data to be tested, and m is the total number of the data to be tested.

Step S102: and performing data preprocessing and grouping processing on all data to be detected to obtain p groups of preprocessed data sets.

It should be noted that each set of preprocessed data includes q pieces of preprocessed data, and p and q are positive integers.

It should be further noted that due to the particularity of the discrete sequence message data, there are usually a large amount of data with equal length or limited length variation in the data of the same protocol, for example, the length of the data of the Automatic Identification System (AIS) is usually 168 bits.

That is, the data lengths of all the data to be detected may not be the same, and all the data to be detected need to be preprocessed, so that the data lengths of all the preprocessed data to be detected are the same.

In the process of implementing step S102 specifically, the data to be measured with the longest data length among all the data to be measured is determined, and all the data to be measured is subjected to filling processing (for example, 0 padding), so that the data length of all the data to be measured is consistent with the data length of the data to be measured with the longest data length.

Such as: assuming that the data length of the data to be detected with the longest data length among all the data to be detected is 168 bits, and the data length of the data to be detected is 166 bits, the data to be detected with the data length of 166 bits is subjected to 0 filling, so that the data length of the data to be detected is also 168 bits, and the data length of all the data to be detected is consistent with the data length of the data to be detected with the longest data length through the above manner.

It is understood that, for better explanation of the process of performing the filling process on all the data to be measured, the above formula (1) is combined for explanation.

Assuming that the data length of the data to be measured with the longest data length in all the data to be measured is n, filling all the data to be measured to obtain a data set to be measured

As in equation (2).

Normalizing all the data to be detected (with consistent data length) after filling processing to obtain a plurality of pieces of preprocessing data, and randomly and uniformly dividing the plurality of pieces of preprocessing data to obtain p groups of preprocessing data sets, such as: the method comprises the steps of dividing a plurality of preprocessed data into p groups of preprocessed data sets after scrambling, wherein each group of preprocessed data sets comprises q preprocessed data.

Step S103: and aiming at each group of preprocessed data sets, extracting bytes at the same position in each piece of preprocessed data of the preprocessed data sets to construct a piece of reconstructed data, and obtaining a reconstructed data set containing a plurality of pieces of reconstructed data.

In the process of implementing step S103 specifically, for each group of preprocessed data sets, bytes (same byte offset) at the same position in each piece of preprocessed data of the preprocessed data set are used as a piece of reconstructed data, so as to obtain a reconstructed data set corresponding to the group of preprocessed data sets and including multiple pieces of reconstructed data. Such as: for all the preprocessed data in a group of preprocessed data sets, extracting the first byte of each preprocessed data set to construct a piece of reconstructed data, extracting the second byte of each preprocessed data set to construct a piece of reconstructed data, and repeating the steps of extracting the byte at the same position of each preprocessed data set to construct the reconstructed data.

To better explain the above process of constructing the reconstruction data, the above formula (2) is combined to illustrate the structural diagram of each set of preprocessed data sets shown in fig. 2.

In fig. 2, the preprocessed data of each group (p groups in total) of preprocessed data sets are aligned, and for the group 1 of preprocessed data sets (group 1), the first byte of each preprocessed data in the group 1 of preprocessed data sets is extracted to construct a piece of reconstructed data b'₁₁Extracting the second byte of each preprocessed data in the 1 st group of preprocessed data sets to construct a piece of reconstructed data b'₁₂And sequentially and longitudinally extracting bytes at the same position in each piece of preprocessed data one by one to construct reconstructed data. Similarly, the processing from the group 2 preprocessed data set to the group p preprocessed data set can be referred to the above, and is not described herein again.

It is understood that each set of preprocessed data sets corresponds to a reconstructed data set including n pieces of reconstructed data with length q, and a total set B' of reconstructed data formed by all the reconstructed data sets is as shown in formula (3).

B′＝{{b′₁₁,b′₁₂,...,b′_1n},...,{b′_p1,b′_p2,...,b′_pn}} (3)

Step S104: and inputting all the reconstructed data sets into a preset identification model for clear and dense identification to obtain a corresponding clear and dense result distribution matrix.

It is understood that when the traffic data is processed by bytes, each byte of the traffic data is similar to each pixel of the image, and if the traffic data is converted into a two-dimensional matrix, the traffic data can be processed as image data by introducing a deep learning technique. Therefore, in the embodiment of the present invention, a neural network model (such as a convolutional neural network model) is used to perform correlation processing on reconstructed data, and the neural network model is trained in advance according to plaintext sample data and ciphertext sample data to obtain an identification model.

The structure of the recognition model is a schematic structural diagram of the recognition model shown in fig. 3, and it should be noted that fig. 3 is only used for illustration. As shown in fig. 3, the structure of the recognition model includes 2 convolutional layers, 2 pooling layers, normalization layer + Relu, 2 fully-connected layers, and Softmax layer, and the neural network specification of the recognition model is shown in table 1.

Table 1:

it should be noted that, in order to ensure the plaintext and ciphertext identification effect and generalization performance of the identification model, when plaintext sample data and ciphertext sample data are constructed, sample data of multiple protocol types (such as character type and binary protocol) and the number of the sample data meets the requirement is selected, and the sample data of various protocol types needs to be kept balanced and no noise is added. It should be further noted that, when constructing ciphertext sample data, various encryption algorithms are used to generate encrypted data and simultaneously select encrypted data in an actual network, and the encrypted data is used as the ciphertext sample data.

The method comprises the steps of converting sample data such as plaintext sample data and ciphertext sample data into a two-dimensional pixel matrix, normalizing the two-dimensional pixel matrix, inputting the normalized sample data into a neural network model, training the normalized sample data to be convergent to obtain an identification model (equivalent to a classifier), namely performing plaintext identification on data input into the identification model through the identification model, and identifying the data as encrypted data or non-encrypted data.

In the process of specifically implementing step S105, inputting all the reconstruction data sets into a preset identification model, and performing the clear density identification k times repeatedly to obtain a clear density result distribution matrix, that is, performing the clear density identification k times repeatedly on all the reconstruction data sets by using the identification model to obtain a corresponding clear density result distribution matrix.

It should be noted that the clear and dense result distribution matrix includes a clear and dense recognition result sub-matrix obtained by performing clear and dense recognition each time, that is, the recognition model obtains a clear and dense recognition result sub-matrix each time the clear and dense recognition is performed, k is a positive integer, and the form of the clear and dense result distribution matrix is as in formula (4).

R＝{r¹,r²,...,r^k} (4)

Wherein, the clear-secret identification result sub-matrix r obtained by the clear-secret identification of the first time^lAs in equation (5).

It is understood that, in the formula (5),

reconstructed data b 'composed of j th byte of i th preprocessed data set'_ijIn the corresponding first-time plaintext identification result, numeral 0 represents plaintext data (unencrypted data), and numeral 1 represents ciphertext data (encrypted data).

It should be noted that, starting and ending positions of encrypted fields of data of different protocol types are different, the encrypted fields of the data of different protocol types have overlapping portions and interleaving portions, and mainly include overlapping fields of plaintext data and plaintext data, plaintext data and ciphertext data, ciphertext data and ciphertext data, plaintext data and padding data, ciphertext data and padding data, and the like.

Since the number of encrypted data of the overlapped section is the largest, the encryption probability of the overlapped section is the highest, and is biased to 1. The interleaving section is a mixture of encrypted data and non-encrypted data, the encryption probability is lower than that of the overlapping section, the overlapping section is generally distributed between 0 and 1, and the encryption probability is different due to the proportion of the encrypted data. The non-secret segments are clear text data, so the encryption probability is low, usually 0.

Step S105: and determining an encryption probability sequence containing the encryption probability of each byte in all the reconstruction data sets according to the clear and secret result distribution matrix.

In the process of implementing step S105 specifically, the method is directed toAnd each clear density identification result sub-matrix calculates the average value of each column of data in the clear density identification result sub-matrix to obtain the initial encryption probability of each byte in all the reconstruction data sets. Combining the above formula (4) and formula (5), the clear density identification result sub-matrix r corresponding to the first clear density identification^lThe initial encryption probability per byte of (a) is as in equation (6).

In the formula (6), the first and second groups,

it should be noted that, for each byte in all reconstructed data sets, each initial encryption probability of the byte can be calculated by using each explicit identification result sub-matrix.

After the initial encryption probability of each byte in all the reconstruction data sets corresponding to each clear encryption identification result sub-matrix is determined, averaging all the initial encryption probabilities of each byte in all the reconstruction data sets to obtain an encryption probability sequence of the final encryption probability of each byte in all the reconstruction data sets. That is, for each byte in all reconstructed data sets, there are k (corresponding to k times of plaintext identification) initial encryption probabilities for the byte, and the average value of the initial encryption probabilities for the byte is obtained, and the obtained average value is the final encryption probability for the byte.

In conjunction with the above, the sequence of encryption probabilities is as shown in equation (7).

P＝{p₁,p₂,...,p_n} (7)

Wherein the content of the first and second substances,

step S106: and determining a plurality of encryption field distribution modes by using the encryption probability sequence and a preset first filtering rule.

In the process of implementing step S106 specifically, derivation is performed on the encryption probability corresponding to each byte in the encryption probability sequence to obtain a derivative of the encryption probability corresponding to each byte.

And determining a plurality of encryption field distribution modes by utilizing bytes (byte offset) corresponding to the derivatives (derivatives of encryption probability) meeting the preset requirement and combining a first filtering rule, wherein the set of all the encryption field distribution modes is called an encryption field set.

The specific form of the encryption field set is as shown in formula (8).

Wherein the content of the first and second substances,

for each encrypted field distribution pattern, each encrypted field distribution pattern is composed of a start point byte and an end point byte of the encrypted field, both represented in byte offsets. As can be seen from the foregoing, each piece of preprocessed data has a length of n, and therefore the starting point byte and the ending point byte are greater than or equal to-1 and less than or equal to n.

Step S107: and calculating the statistic corresponding to the data to be detected in each encrypted field distribution mode aiming at each piece of data to be detected, and determining the encrypted field distribution mode corresponding to the statistic meeting the preset matching requirement as the matched encrypted field distribution mode of the data to be detected by combining with a preset second filtering rule.

For each piece of data to be tested, after the distribution modes of the various encrypted fields are determined, the distribution mode of the encrypted fields matched with the data to be tested is determined from the distribution modes of the various encrypted fields, namely the distribution mode of the matched encrypted fields of the data to be tested is determined. As can be seen from the content of step S106, the starting point byte and the ending point byte of the encrypted field are known in each group of encrypted field distribution patterns, that is, after the matching encrypted field distribution pattern of the data to be tested is determined, the specific content of the encrypted field and the non-encrypted field corresponding to the data to be tested can be determined.

In the process of implementing step S107 specifically, for each piece of data to be measured, the statistics corresponding to each encryption field distribution mode of the data to be measured is calculated, that is, each encryption field distribution mode corresponds to one calculated statistic, and the encryption field distribution mode corresponding to the statistics meeting the preset matching requirement is determined as the matching encryption field distribution mode of the data to be measured according to the second filtering rule.

By the method, the distribution mode of the matched encrypted field corresponding to each piece of data to be detected is determined, and the specific content of the encrypted field and the non-encrypted field corresponding to each piece of data to be detected can be determined.

The distribution mode of the matched encrypted fields of all the data to be tested is expressed in a form shown in a formula (9).

field_pre＝{f₁,f₂,...,f_m} (9)

Wherein, in the formula (9),

representing the ith piece of data to be measured b_iThe encryption field of (a) is set,

is data b to be measured_iThe starting point byte of the encrypted field of (c),

is data b to be measured_iThe end point byte of the encrypted field of (c),

and

are all represented in terms of byte offsets that,

in the embodiment of the invention, data preprocessing and grouping processing are carried out on all data to be detected to obtain p groups of preprocessed data sets, and bytes at the same position in the preprocessed data are extracted to construct reconstruction data aiming at each preprocessed data set to obtain a reconstruction data set. And inputting all the reconstructed data sets into an identification model for clear and secret identification to obtain a clear and secret result distribution matrix, and determining an encryption probability sequence according to the clear and secret result distribution matrix. A plurality of encryption field distribution patterns are determined using the encryption probability sequence and the first filter rule. And calculating the statistic of the data to be detected in each encryption field distribution mode aiming at each data to be detected, and determining the matched encryption field distribution mode of the data to be detected by combining a second filtering rule, thereby identifying the encryption flow of the private protocol and determining the encryption field distribution mode of the encryption flow, and improving the identification accuracy and the identification effect of the encryption flow.

The above-mentioned process for determining multiple encryption field distribution modes in step S106 in fig. 1 according to the embodiment of the present invention is shown in fig. 4, which is a flowchart for determining multiple encryption field distribution modes provided in the embodiment of the present invention, and includes:

step S401: and determining the derivative of each encryption probability in the encryption probability sequence by using a preset derivative function.

In the process of implementing step S401 specifically, a preset derivation function is used to derive the encryption probability sequence corresponding to the above formula (7), so as to obtain a derivative of each encryption probability in the encryption probability sequence.

The derivative function is shown in equation (10).

In the formula (10), O (h)²) For error, it can be understood that, when calculating the derivative of each encryption probability in the encryption probability sequence using equation (10), the derivatives of the last two encryption probabilities of the encryption probability sequence cannot be calculated, and thus the derivatives of the last two encryption probabilities of the encryption probability sequence are complemented by means of copy padding. That is, only the encryption probability sequence P ═ { P } can be calculated by equation (10)₁,p₂,...,p_nDerivatives of the 1 st to n-2 th encryption probabilities in the } copy the 1 stTaking derivatives of n-2 encryption probabilities as derivatives of the n-1 th encryption probability and the n-th encryption probability of the encryption probability sequence, thereby obtaining P ═ { P ═₁,p₂,...,p_nThe derivative of P' ═ P } P₁′,p₂′,...,p′_n}。

It should be noted that, in order to better explain how to construct the formula (10), the following explanation is provided.

Assume that the encryption probability sequence P ═ P₁,p₂,...,p_nThe overall obeying function y ═ p (x), where x takes n different values, i.e., x ═ x₁,x₂,...,x_nIn which (x)₁＜x₂＜...＜x_n) Using Taylor's formula to make the function y be p (x) in x₀The resulting product was developed to obtain formula (11).

R in formula (11)_n(x) Is a ratio (x-x)₀)ⁿInfinitesimal high order, let h be x-x₀Then, equation (11) is expressed as equation (12).

p(x)＝p(x₀)+p′(x₀)h+O(h²) (12)

In the formula (12), O (h)²) Is a ratio of h²The high order of infinity is small, therefore, equation (12) can be approximated to solve for p (x) at x₀The derivative of (c) as in equation (13).

Similarly, the value of P in x can be obtained according to the formula (13)_kThe derivative of (a) as in equation (14).

Equation (15) is further derived from equation (14).

To increase P at x_kThe accuracy of the derivative is obtained by substituting equation (15) into equation (11) and retaining the second order term, i.e., the derivative function shown in equation (10).

Step S402: and combining the bytes corresponding to the encryption probability with the derivative larger than the derivative threshold value to obtain a plurality of initial encryption field distribution modes.

It should be noted that, for the bytes in the interval of the aforementioned overlapped segment, the interlaced segment and the non-dense segment, the encryption probability is unchanged or changed little, and the derivative in the interval is close to 0. And the encryption probability of the starting and ending positions of the fields such as the overlapped section, the staggered section, the non-dense section and the like can be suddenly changed, the derivative can be correspondingly stepped, the byte (represented by byte offset) corresponding to the starting and ending position of the derivative stepped is also called a jumping point, and the byte corresponding to the encryption probability of which the derivative is larger than the derivative threshold is called a jumping point.

In the process of implementing step S402 specifically, from the derived encryption probability sequence, the bytes corresponding to the encryption probabilities whose derivatives are greater than the derivative threshold are determined, and all the bytes of the encryption probabilities whose derivatives are greater than the derivative threshold are collected as formula (16).

In the formula (16), z₁To z_tThe byte representing the encryption probability that the derivative is greater than the derivative threshold is the derivative threshold, and the derivative threshold can be set according to the actual situation, for example, the derivative threshold is set as the average of all derivatives in the encryption probability sequence after derivation.

It will be appreciated that for the determined jump points, each jump point may be a start point byte or an end point byte of the encrypted field, and thus all possible initial encrypted field distribution patterns may be obtained by the determined combination of jump points, the set of all initial encrypted field distribution patterns being as in equation (17).

field＝{＜z₁,z₂＞,＜z₁,z₃＞,...,＜z₁,z_t＞,＜z₂,z₃＞,＜z₂,z₄＞,...,＜z_t-1,z_t＞} (17)

In formula (17) < z₁,z₂From > to < z_t-1,z_tThe pattern is distributed for each possible initial encryption field.

Step S403: and filtering all initial encryption field distribution modes by using a preset first filtering rule to obtain a plurality of final encryption field distribution modes.

It is understood that all the initial encryption field distribution patterns determined above include all possible encryption field distribution patterns, wherein there may be initial encryption field distribution patterns that do not conform to the protocol design logic, and therefore all the initial encryption field distribution patterns need to be filtered.

In the process of implementing step S403 specifically, a first filtering rule is preset, for example: in the normal case, the header of the encryption protocol is control information that is not encrypted, so the starting point of the encrypted field does not start from the first byte of data, and a first filtering rule is set according to this feature, the first filtering rule being as in equation (18).

It should be noted that, the above-mentioned first filtering rule is only used for example, and a corresponding first filtering rule may be set according to specific situations, which is not limited herein.

And filtering all initial encryption field distribution modes by using a preset first filtering rule to obtain a plurality of final encryption field distribution modes.

It will be appreciated that a matching encrypted field distribution pattern for each piece of data under test is determined from the determined plurality of final encrypted field distribution patterns.

To better explain how to obtain the multiple final encryption field distribution modes, the contents of the above mentioned formulas are combined and exemplified in the form of algorithms (such as python3.6 and tensorflow1.12.0) running, and the specific contents are as follows:

encrypting probability sequence P ═ P₁,p₂,...,p_nAnd derivative threshold as inputs.

The jumping point set z and the encryption field distribution pattern set field are initialized, i.e., z ═ { null } and field ═ null }.

The logic for generating a set of various final encryption field distribution patterns (see equation (8)) is as follows.

Determining a jumping point: for i ═ 1, 2, …, n-2, if

Then z is z + i.

Filtering according to a first filtering rule: if z is₁0, then z-z₁。

Generating a set of a plurality of final encryption field distribution patterns: when i is 1, 2, …, len (z) -1, when j is 2, 3, …, len (z), field_sus＝field_sus+f_{(i-1)*(t-1)+j-1}＝＜z_i,z_j＞。

Here, len (z) represents the number of elements in the set z.

After the operation is carried out, a set of a plurality of final encryption field distribution modes is finally output

In the embodiment of the invention, the derivative of each encryption probability in the encryption probability sequence is calculated according to the derivative function, and a plurality of final encryption field distribution modes are determined according to the calculated derivative and the first filtering rule. And calculating the statistic corresponding to the data to be detected in each final encryption field distribution mode aiming at each piece of data to be detected, and determining the final encryption field distribution mode corresponding to the statistic meeting the preset matching requirement as the matching encryption field distribution mode of the data to be detected by combining with a preset second filtering rule. Therefore, the encrypted flow of the private protocol is identified, the encrypted field distribution mode of the encrypted flow is determined, and the identification accuracy and the identification effect of the encrypted flow are improved.

In the above-described embodiment of the present invention, referring to fig. 5, a process of determining a distribution pattern of a matching encryption field of to-be-detected data in step S107 in fig. 1 is shown as a flowchart of determining a distribution pattern of a matching encryption field of to-be-detected data according to an embodiment of the present invention, which includes:

step S501: and determining the non-encrypted field and the encrypted field of the data to be detected in each encrypted field distribution mode aiming at each piece of data to be detected, and calculating the variance difference between the non-encrypted field and the encrypted field in each encrypted field distribution mode.

The inventors have found through research that: the data has stronger randomness after being encrypted, and the value of the encrypted data in the interval of 0-255 on each byte is equal in probability, namely the encrypted data obeys the discrete uniform distribution on [0-255], so that the byte value oscillation of the encrypted data observed in the transverse direction is obvious. The non-encrypted data is generally a header or a check part, has specific format and semantic information, is not strong in randomness, and is not obvious in oscillation on a byte value. There is a large difference in the byte variance of encrypted and non-encrypted fields.

Therefore, the statistical quantity in step S107 in fig. 1 of the embodiment of the present invention is the variance difference, and the statistical quantity can be set as other types of statistical quantities, which is not limited in this respect.

As can be seen from the above description shown in fig. 1 and fig. 4 of the embodiments of the present invention, each encrypted field distribution mode includes the start point byte and the end point byte of the corresponding encrypted field, so that the data to be tested has the corresponding encrypted field and the corresponding unencrypted field in each encrypted field distribution mode.

In the process of implementing step S501, for each piece of data to be measured, an unencrypted field (also referred to as a suspected unencrypted field) and an encrypted field (also referred to as a suspected encrypted field) of the data to be measured in each encrypted field distribution mode are determined, and a variance difference between the unencrypted field and the encrypted field of the data to be measured in each encrypted field distribution mode is calculated.

For each piece of data to be tested, calculating variance difference values between non-encrypted fields and encrypted fields corresponding to the data to be tested in each encrypted field distribution mode to obtain a plurality of variance difference values corresponding to the data to be tested, wherein each encrypted field distribution mode corresponds to one variance difference value.

Such as: it is assumed that for a piece of data to be measured, the distribution pattern in the encrypted field is determined

Calculating the distribution mode of the encrypted field in the non-encrypted field and the encrypted field of the data to be tested

Variance of non-encrypted field of the data to be measured

And calculating an encrypted field distribution pattern

Variance of encrypted field of the data to be measured

Recalculation encrypted field distribution pattern

Corresponding to the data to be measured

And

and calculating the variance difference value of the data to be measured in other encryption field distribution modes by analogy.

Step S502: and determining the encryption field distribution mode corresponding to the maximum variance difference value as the matching encryption field distribution mode of the data to be detected by combining a preset second filtering rule for each piece of data to be detected.

It should be noted that, in the process of determining the matching encrypted field distribution pattern corresponding to each piece of data to be measured from all the encrypted field distribution patterns, the determination needs to be performed in combination with a preset second filtering rule.

The second filtering rule may be set according to actual conditions, and may include a plurality of sub-filtering rules, such as: the second filtering rule includes a first filtering sub-rule and a second filtering sub-rule, and the specific contents of the first filtering sub-rule and the second filtering sub-rule are as follows.

The first filtering sub-rule is that in the same message data, the encrypted field is usually continuous and uninterrupted, and there are no multiple different encrypted fields, that is, for any piece of data to be tested, the encrypted field of the data to be tested only contains one of all the encrypted field distribution modes, and the first filtering sub-rule is as shown in equation (19).

Second filter sub-rule: the trailer of the message data is typically not an encrypted field and the second filtering sub-rule is as in equation (20).

In the process of implementing step S502 specifically, for each piece of data to be tested, the encrypted field distribution pattern corresponding to the largest variance difference is determined as the matched encrypted field distribution pattern of the data to be tested, in combination with the second filtering rule. That is, for each piece of data to be measured, the encryption field distribution pattern that maximizes the variance difference corresponding to the data to be measured is selected as the matching encryption field distribution pattern f_pred。

Through the method, the distribution mode of the matched encrypted field of each piece of data to be tested is determined.

The specific form of the distribution pattern of the matched encrypted field of the data to be tested is shown as formula (21).

For better explanation, how to determine the distribution mode of the matching encrypted field of each piece of data to be tested is exemplified by the form of algorithm operation in combination with the content of each formula mentioned above, and the specific content is as follows:

all the data to be tested are processed

As input, and all encrypted fields are distributed (final encrypted field distribution pattern)

As an input. Wherein the content of the first and second substances,

to

The pattern is distributed for the final encrypted field.

Initializing the distribution pattern of the matching encrypted fields of all data to be tested, i.e. field_pre＝{null}。

For

i

1, 2, …, m, the variance difference set is Δ V null.

For j ═ 1, 2, …, v, the variance difference is calculated as: if it is not

Δv_j0, otherwise

Where Var (x) represents the variance of vector x.

ΔV＝ΔV+(Δv_j)。

According to the first filtering sub-rule and the second filtering sub-rule, removing the maximum difference index: p is a radical of_i＝index(max(ΔV))。

Finally outputting the distribution mode of the matched encrypted field corresponding to each data to be tested

Wherein p is_iCorresponding field_susThe element index of (2).

In the embodiment of the invention, for each data to be detected, the variance difference value between the non-encrypted field and the encrypted field of the data to be detected in each encrypted field distribution mode is calculated, and the encrypted field distribution mode corresponding to the maximum variance difference value is the matched encrypted field distribution mode of the data to be detected, so that the encrypted flow of the private protocol is identified, the encrypted field distribution mode of the encrypted flow is determined, and the identification accuracy and the identification effect of the encrypted flow are improved.

To better explain the contents shown in fig. 1, fig. 4 and fig. 5 of the above embodiments of the present invention, how to determine the distribution pattern of the matching encryption field of the data to be tested is illustrated by the following contents, which should be noted that the following contents are only used for illustration.

Data set processing: selecting text data, hypertext protocol (THHP) data, secure shell protocol (SSH) data, transport layer security protocol (TLS) data, real data in Aircraft Communication Addressing and Reporting Systems (ACARS) and real data in Automatic Identification Systems (AIS) of ships, wherein the specific content of each type of data is as follows:

text data: a large amount of text data is used as unencrypted data, and the text data is encrypted through 5 encryption algorithms such as a data encryption standard, a triple data encryption algorithm, a high-level encryption standard, Blowfish, RC4 and the like to obtain encrypted data.

THHP data, SSH data, TLS data: and screening from a public data set MACCDC2012, wherein the HTTP data is encrypted by utilizing a plurality of encryption algorithms to obtain encrypted data, the full data part of the SSH data is encrypted, and the TCP header content of the SSH data is stored during screening.

ACARS: the actual data of the ACARS are divided into ACARS uplink message data and ACARS downlink message data, and the load part of the actual data of the ACARS is encrypted by utilizing various encryption algorithms.

AIS: the AIS message 1 and AIS message 4 data in the AIS are encrypted using a variety of encryption algorithms.

The structure of the encrypted field obtained after the data set processing is shown in the schematic diagram of the structure of the encrypted field shown in fig. 6, the neural network model is trained by using the text data, the HTTP data and the AIS message 4 data to obtain the identification model, and the data corresponding to the ACARS, the TLS, the AIS and the SSH are used as the data to be measured.

It should be noted that fig. 6 shows only the encryption fields corresponding to part of the data, and the encryption fields of other data are not described in detail in this embodiment of the present invention.

In order to verify the effect of determining the distribution mode of the matched encrypted field of the data to be tested in the embodiment of the invention, the following three evaluation indexes are set.

It can be understood that, when evaluating and determining the effect of the distribution pattern of the matching encrypted field of the data to be tested (whose true encrypted field is known), the matching condition of the distribution pattern of the matching encrypted field of the data to be tested and the true encrypted field is considered, so the effect of the distribution pattern of the matching encrypted field of the data to be tested is evaluated and determined through the byte precision, the byte recall ratio and the F1 value. Wherein, byte precision, byte recall and F1 are determined according to the real encrypted fields of all the data to be tested

And matching encryption field distribution patterns

And carrying out comparison calculation to obtain the product.

Byte precision ratio: in order to examine the proportion of the real encrypted bytes in the bytes of the encrypted field of the data to be tested which match the distribution pattern of the encrypted field, equation (22) is given.

Byte recall ratio: to examine the proportion of the number correctly identified in all bytes of the real encrypted field of the data to be tested, as shown in equation (23).

F1 value: and (4) integrating the byte search rate and the byte recall rate, as shown in a formula (24).

The process of determining the distribution pattern of the matching encrypted field of the data to be tested using the above is as follows.

Training a neural network model to obtain a recognition model: 40000 pieces of non-encrypted text data, 20000 pieces of HTTP data and 20000 pieces of non-encrypted AIS message 4 data are selected as plaintext sample data, 40000 pieces of encrypted text data, 20000 pieces of HTTP data and 20000 pieces of encrypted AIS message 4 data are selected as ciphertext sample data, and the plaintext sample data and the ciphertext sample data constitute a sample data set.

From the sample data set, 90% of the data was selected as training data and 10% of the data was selected as test data. And truncating all data according to the data length of 64 bytes, dividing the truncated data by 255 for normalization, and converting each piece of data into a two-dimensional matrix with the specification of 8 x 8 after normalization.

The learning rate of the convolutional neural network model was set to 0.001, the Batchsize to 64, the input size to 8 x 8, and the training period to 1000.

And constructing a convolutional neural network model, wherein a first convolutional layer of the convolutional neural network model comprises 16 convolutional kernels with the size of 5 x 5, the first pooling layer of the convolutional neural network model has the size of 2 x 2, a second convolutional layer of the convolutional neural network model comprises 24 convolutional kernels with the size of 5 x 5, the second pooling layer of the convolutional neural network model has the size of 2 x 2, the number of the neurons of the first fully-connected layer of the convolutional neural network model is 24, and the number of the neurons of the second fully-connected layer of the convolutional neural network model is 2.

Inputting the preprocessed sample data set into the constructed convolutional neural network model for training to obtain an identification model, wherein the effect of training the convolutional neural network model is as the schematic training effect diagram of the convolutional neural network model shown in fig. 7, and when the training period is greater than 20, the training accuracy is above 0.99, and the training loss is below 0.03.

And (3) identifying the data to be detected by using the identification model: 9000 ACARS uplink message data with the data length of 29 bytes, 9000 ACARS downlink message data with the data length of 40 bytes (the longest length) and 9000 AIS message 1 data with the data length of 21 bytes are selected as data to be detected.

The selected ACARS uplink message data and AIS message 1 data are respectively filled with all-zero values of 11 bytes and 19 bytes, and the ACARS downlink message data, the processed ACARS uplink message data and the processed AIS message 1 data (namely, preprocessed data) are randomly scrambled and grouped into 421 groups of preprocessed data sets, wherein each group of preprocessed data sets comprises 64 preprocessed data.

And for each group of preprocessed data sets, extracting bytes at the same position in each preprocessed data of the preprocessed data sets to construct a piece of reconstructed data, and converting the reconstructed data into a two-dimensional matrix of 8 x 8. Carrying out clear and secret identification on all reconstruction data by utilizing the identification model obtained by the training to obtain a clear and secret result distribution matrix R_421*40。

To carry out R_421*40Longitudinal statistics is carried out to obtain a corresponding encryption probability sequence P ═ { P ═ P₁,p₂,...,p₄₀}。

It can be understood that the data to be measured obtained by mixing the ACARS uplink message data, the ACARS downlink message data and the AIS message 1 data is selected, and zero filling and supplementing are performed on the data with short length. And performing clear and secret identification on the reconstructed data by using the identification model to obtain a corresponding clear and secret result distribution matrix, and performing correlation processing on the clear and secret result distribution matrix to obtain an encryption probability sequence.

It should be noted that the encryption probability curve can reflect the overlapping and interleaving conditions of the encrypted fields after the data of different protocol types are mixed, the encryption probability of the overlapping portion is the highest, the encryption probability of the interleaving portion is inferior to that of the overlapping portion, and the encryption probability of the non-secret portion is the lowest.

Procedure for determining the final encrypted field distribution pattern: p ═ P₁,p₂,...,p₄₀Fill in to get P ═ P₁,p₂,...,p₄₀；p₄₀；p₄₀}. Using a derivative function f' (x)_k)＝(-f(x_k+2)+4f(x_k+1)-3f(x_k) 2 pairs P ═ P₁,p₂,...,p₄₀；p₄₀；p₄₀Derivation is carried out to obtain a derivative P ' ═ P ' of the encryption probability '₁,p′₂,...,p′₄₀}。

Determining derivative threshold

According to the derivative threshold value, jumping points (nodes with derivative larger than the derivative threshold value) are extracted from the derivative of the encryption probability, and z is obtained as { z ═ z₁,...,z_k}。

Determining a final encrypted field distribution pattern using the respective jumping points

The determined final encryption field distribution pattern (referred to as a suspected encryption field in table 2, only a portion of which is shown) is as shown in table 2.

Table 2:

it can be understood that table 2 at least includes some actual encryption fields of the ACARS uplink message data and the ACARS downlink message data, such as: encryption fields (14-29) and (25-40).

By using the final encrypted field distribution pattern determined in table 2 and combining the content of determining the distribution pattern of the matched encrypted field of the data to be detected shown in fig. 5 in the embodiment of the present invention, the encrypted field distribution pattern of each piece of data to be detected is matched, that is, the distribution pattern of the matched encrypted field of each piece of data to be detected is determined, where the distribution pattern of the matched encrypted field of each piece of data to be detected is the content shown in table 3, it should be noted that the sample in table 3 indicates each piece of data to be detected, and the start and end positions of the identified encrypted field indicate the distribution pattern of the matched encrypted field.

Table 3:

and evaluating the distribution mode of the matched encrypted field of each piece of to-be-measured data determined in the table 3 through the evaluation indexes shown in the formulas (22) to (24) to obtain a corresponding evaluation result.

Corresponding to the method for identifying an encrypted field of a private protocol provided in the foregoing embodiment of the present invention, referring to fig. 8, an embodiment of the present invention further provides a structural block diagram of an identification system for an encrypted field of a private protocol, where the identification system includes: an acquisition unit 801, a processing unit 802, a construction unit 803, a recognition unit 804, a first determination unit 805, a second determination unit 806, and a matching unit 807;

the obtaining unit 801 is configured to obtain multiple pieces of data to be detected, where each type of the data to be detected is discrete sequence packet data.

The processing unit 802 is configured to perform data preprocessing and grouping processing on all data to be detected to obtain p groups of preprocessed data sets, where each group of preprocessed data sets includes q pieces of preprocessed data, and p and q are positive integers.

The constructing unit 803 is configured to, for each group of preprocessed data sets, extract bytes at the same position in each piece of preprocessed data of the preprocessed data set to construct a piece of reconstructed data, and obtain a reconstructed data set including multiple pieces of reconstructed data.

And the identification unit 804 is used for inputting all the reconstruction data sets into a preset identification model for clear and secret identification to obtain a corresponding clear and secret result distribution matrix, and the identification model is obtained by training a neural network model according to plaintext sample data and ciphertext sample data.

A first determining unit 805, configured to determine an encryption probability sequence including an encryption probability of each byte in all reconstruction data sets according to the plaintext/ciphertext distribution matrix.

A second determining unit 806, configured to determine a plurality of encryption field distribution patterns by using the encryption probability sequence and a preset first filtering rule.

The matching unit 807 is configured to calculate, for each piece of data to be detected, statistics corresponding to the data to be detected in each encrypted field distribution mode, and determine, in combination with a preset second filtering rule, that the encrypted field distribution mode corresponding to the statistics meeting a preset matching requirement is a matched encrypted field distribution mode of the data to be detected.

Preferably, in a specific implementation, the identifying unit 804 is specifically configured to: inputting all the reconstructed data sets into a preset identification model, and repeatedly performing the plaintext and ciphertext identification k times to obtain a plaintext and ciphertext result distribution matrix, wherein the plaintext and ciphertext result distribution matrix comprises a plaintext and ciphertext identification result sub-matrix obtained by performing the plaintext and ciphertext identification each time, and k is a positive integer.

Correspondingly, the first determining unit 805 is specifically configured to: calculating the average value of each column of data in the clear and dense identification result sub-matrix aiming at each clear and dense identification result sub-matrix to obtain the initial encryption probability of each byte in all the reconstruction data sets; and averaging all initial encryption probabilities of each byte in all the reconstruction data sets to obtain an encryption probability sequence of the final encryption probability of each byte in all the reconstruction data sets.

Preferably, in conjunction with what is shown in fig. 8, the second determining unit 806 includes: the method comprises a determining module, a combining module and a filtering module, wherein the execution principle of each module is as follows:

and the determining module is used for determining the derivative of each encryption probability in the encryption probability sequence by using a preset derivative function.

And the combination module is used for combining the bytes corresponding to the encryption probability with the derivative larger than the derivative threshold value to obtain various initial encryption field distribution modes.

And the filtering module is used for filtering all initial encryption field distribution modes by using a preset first filtering rule to obtain a plurality of final encryption field distribution modes.

Preferably, in conjunction with the content shown in fig. 8, the matching unit 807 includes: the system comprises a processing module and a matching module, wherein the execution principle of each module is as follows:

and the processing module is used for determining the non-encrypted field and the encrypted field of the data to be detected in each encrypted field distribution mode and calculating the variance difference between the non-encrypted field and the encrypted field in each encrypted field distribution mode aiming at each piece of data to be detected.

And the matching module is used for determining the encryption field distribution mode corresponding to the maximum variance difference value as the matching encryption field distribution mode of the data to be detected by combining a preset second filtering rule aiming at each piece of data to be detected.

Preferably, in connection with what is shown in fig. 8, the processing unit 802 includes: the device comprises a determining module, a processing module, a normalizing module and a dividing module, wherein the execution principle of each module is as follows:

and the determining module is used for determining the data to be tested with the longest data length in all the data to be tested.

And the processing module is used for filling all the data to be detected, so that the data length of all the data to be detected is consistent with the data length of the data to be detected with the longest data length.

And the normalization module is used for normalizing all the data to be detected after the filling processing to obtain a plurality of pieces of preprocessing data.

In summary, the embodiment of the present invention provides a method for identifying an encrypted field of a private protocol, where data preprocessing and grouping are performed on all data to be detected to obtain p groups of preprocessed data sets, and for each preprocessed data set, bytes at the same position in the preprocessed data are extracted to construct reconstructed data, so as to obtain a reconstructed data set. And inputting all the reconstructed data sets into an identification model for clear and secret identification to obtain a clear and secret result distribution matrix, and determining an encryption probability sequence according to the clear and secret result distribution matrix. A plurality of encryption field distribution patterns are determined using the encryption probability sequence and the first filter rule. And calculating the statistic of the data to be detected in each encryption field distribution mode aiming at each data to be detected, and determining the matched encryption field distribution mode of the data to be detected by combining the second filtering rule, thereby identifying the encryption flow of the private protocol and determining the encryption field distribution mode of the encryption flow, and improving the identification accuracy and the identification effect of the encryption flow.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for identifying encrypted fields of a private protocol, the method comprising:

2. The method according to claim 1, wherein the determining a plurality of encryption field distribution patterns by using the encryption probability sequence and a preset first filtering rule comprises:

3. The method according to claim 1, wherein the calculating, for each piece of the data to be tested, statistics corresponding to each of the encrypted field distribution patterns of the data to be tested, and determining, in combination with a preset second filtering rule, the encrypted field distribution pattern corresponding to the statistics meeting a preset matching requirement as a matching encrypted field distribution pattern of the data to be tested, includes:

4. The method according to claim 1, wherein the performing data preprocessing and grouping on all the data to be tested to obtain p groups of preprocessed data sets comprises:

5. The method according to claim 1, wherein the inputting all the reconstruction data sets into a preset recognition model for clear and dense recognition to obtain a corresponding clear and dense result distribution matrix comprises:

6. A system for identifying encrypted fields of a private protocol, the system comprising:

7. The system according to claim 6, wherein the second determination unit comprises:

8. The system of claim 6, wherein the matching unit comprises:

9. The system of claim 6, wherein the processing unit comprises:

10. The system according to claim 6, wherein the identification unit is specifically configured to: inputting all the reconstruction data sets into a preset identification model, and repeatedly performing the plaintext and ciphertext identification k times to obtain a plaintext and ciphertext result distribution matrix, wherein the plaintext and ciphertext result distribution matrix comprises plaintext and ciphertext identification result sub-matrices obtained by performing plaintext and ciphertext identification each time, and k is a positive integer;