CN117354207A - Reverse analysis method and device for unknown industrial control protocol - Google Patents

Reverse analysis method and device for unknown industrial control protocol Download PDF

Info

Publication number
CN117354207A
CN117354207A CN202311243871.1A CN202311243871A CN117354207A CN 117354207 A CN117354207 A CN 117354207A CN 202311243871 A CN202311243871 A CN 202311243871A CN 117354207 A CN117354207 A CN 117354207A
Authority
CN
China
Prior art keywords
message
graph
gcn
mess
industrial control
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311243871.1A
Other languages
Chinese (zh)
Inventor
姚羽
杨道青
张尼
杨巍
杨利成
胡耀
吴云峰
林小李
单垚
冉子用
韩庆敏
王乾亦
刘福意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
6th Research Institute of China Electronics Corp
Original Assignee
6th Research Institute of China Electronics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 6th Research Institute of China Electronics Corp filed Critical 6th Research Institute of China Electronics Corp
Priority to CN202311243871.1A priority Critical patent/CN117354207A/en
Publication of CN117354207A publication Critical patent/CN117354207A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/18Protocol analysers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of protocol reverse engineering, and provides a method and a device for reverse analysis of an unknown industrial control protocol. Forming a message fragment set based on the network message; the message and the message fragment set form a data set, and a heterogeneous message diagram is constructed according to the data set; inputting the heterogeneous message graph to a constructed message graph feature extraction neural network model for message clustering and training the message graph feature extraction neural network model; the data set to be clustered is input into a trained message graph feature extraction neural network model, class cluster division is carried out, a Needleman-Wunsch algorithm is adopted to infer a grammar format under the same message class cluster, and field boundaries are marked. The method reduces the time complexity of message clustering; performing joint optimization on feature extraction and clustering; providing a finer granularity, more reasonable analysis unit for the model.

Description

Reverse analysis method and device for unknown industrial control protocol
Technical Field
The invention relates to the field of protocol reverse engineering, in particular to a method and a device for reverse analysis of an unknown industrial control protocol.
Background
With the advancement of the "industry 4.0" strategy, the status of Industrial Control Systems (ICSs) in national critical infrastructure and national economy is increasingly important, and it is of great importance to ensure safe operation thereof. ICSs are composed of various control components and are transmitted using specific Industrial Control Protocols (ICPs) (e.g., modbus, S7, DNP 3). Due to the lack of international universal standards and the consideration of equipment manufacturers for self benefits, a large number of proprietary ICPs with unknown protocol protocols and neglected security mechanisms (such as authentication and encryption) exist in the ICSs, so that security operation and maintenance personnel cannot analyze the protocols, and the ICSs are more outstanding in security problem.
In order to obtain the specifications of unpublished ICPs, protocol Reverse Engineering (PRE) has received widespread attention, which PRE can infer protocol specifications using network messages or parsing procedures to obtain knowledge about its syntax and behavior. However, unlike the text protocol, the industrial control protocol belongs to a binary protocol, and has the properties of messy content, no separator, unreadable property and the like, so that the reverse analysis method facing the text protocol is difficult to be effectively applied to the field of the industrial control protocol. In the face of these industrial control protocols, which are abstract in content and have no published protocol conventions, how to construct an accurate and effective protocol analysis model becomes a very troublesome problem.
Paper "Y.Wang, X.Yun, M.Z.Shafiq, L.Wang, A.X.Liu, Z.Zhang, D.Yao, Y.Zhang, and l.guo.a semantics aware approach to automated reverse engineering unknown protocols in 2012 20th IEEE International Conference on Network Protocols (ICNP), pages 1-10.ieee, 2012" proposes a multi-step protocol reverse analysis model ProDecoder based on semantic recognition. The method first adopts n-gram to divide network message into a series of messages Wen Pianduan as the minimum input unit of the model. Secondly, extracting protocol keywords by using a topic analysis model LDA in the NLP field, and forming a keyword set. And then clustering the message sequences based on the obtained keywords, and dividing the messages of the same protocol type into the same cluster. And finally, adopting a sequence alignment algorithm under each class cluster to infer a protocol format. However, the coarse-grained message segmentation based on n-gram ignores the fact that the industrial control protocol part field is composed of only multiple bits, and generates a large number of meaningless messages Wen Pianduan, thereby affecting downstream analysis. In addition, the analysis mode of the multi-step pipeline is easy to cause error propagation, namely, the deviation of the precursor task directly influences the accuracy of the subsequent task, so that the final analysis result is deviated.
Paper "Y.Ye, Z.Zhang, F.Wang, X.Zhang, D.Xu, netPlier: probabilistic network protocol reverse engineering from message traces, in:28th Annual Network and Distributed System Security Symposium,NDSS2021,Virtually,February 21-25,2021," based on one assumption: the key field determines the type of the network message and provides an unknown protocol analysis method NetPlier based on probabilistic reasoning. The method firstly adopts a multi-sequence alignment algorithm to obtain an aligned field set. Then, the recognition problem of the message key words is converted into searching the field with the maximum probability value from the aligned field set. The network messages are then divided into different types according to the identified keywords, and the protocol formats and state machines are inferred under the same type of message set. Because the method is focused on the identification of the key field, the key field is considered to determine the type of the current network message, and the message clustering is carried out according to the type of the current network message. However, the key fields do not fully reflect the type characteristics of the message, and in view of the repetitive, abstract, unreadable nature of ICPs message content, obtaining accurate keywords is time consuming and error prone.
In summary, the existing protocol reverse analysis method has the problems of unreasonable message segmentation, single message type characteristic, defective analysis flow and the like, so that the finally inferred protocol format and protocol state machine have low accuracy, are difficult to apply in industrial control protocol analysis, and have low practicality and reference.
Disclosure of Invention
The invention aims to provide an unknown industrial control protocol reverse analysis method and device which are high in accuracy and suitable for an industrial control system network. The proposed heuristic message segmentation algorithm can divide the original message sequence into fields with arbitrary lengths, provide finer granularity analysis units and improve the accuracy of downstream tasks. The message graph feature extraction model Mess-GCN has stronger robustness, fully utilizes interaction information among subtasks, can capture richer message structural features, and obtains better clustering effect. The actual application situation is fully considered, manual intervention on the network message is not needed, and the method is an automatic reverse analysis mode.
The technical scheme of the invention is as follows: an unknown industrial control protocol reverse analysis method comprises the following steps:
forming a message fragment set by the network message based on a heuristic message segmentation algorithm; the message and the message fragment set form a data set, and a heterogeneous message graph G (V, E) is constructed according to the data set; inputting the heterogeneous message graphs G (V, E) to a message graph feature extraction neural network model Mess-GCN for message clustering and training the message graph feature extraction neural network model Mess-GCN; the data set to be clustered is input into a trained message graph feature extraction neural network model (Mess-GCN), class cluster division is carried out, a Needleman-Wunsch algorithm is adopted to infer a grammar format under the same message class cluster, and a field boundary is marked.
The forming the message segment set based on the heuristic message segmentation algorithm specifically comprises the following steps:
step 1.1, extracting application layer data of a network message to obtain a hexadecimal format message sequence, wherein the message sequence is K types in total;
step 1.2, carrying out field inference on all message sequences based on heuristic rules to obtain a public field; the common field comprises a protocol identifier and a length field;
step 1.3, performing position coding on the rest part of the message sequence except the common field, and adopting an n-gram to obtain a plurality of message fragments;
and 1.4, performing de-duplication operation on the public fields and the message fragments to form a message fragment set.
The heuristic rules are as follows:
rule.a) identification protocol identifier field: the character content of the first N bytes of the plurality of message sequences is the same, and the character content of the first N bytes is a protocol identification field;
rule.b) identify length field: dividing the message sequence into different groups according to the length, and judging whether the values Q of the continuous M bytes at the same position are the same or not under the same group of messages; when the same position in the same group of messages has the same value Q, and the same position among different groups of messages has different values Q, and Q is the length of the message corresponding to the group, the continuous M bytes are identified as a length field.
The specific operation process of the step 1.3 is as follows: and splicing the number of the position of each byte in the original message sequence with the character content of the corresponding byte, and sliding the original message sequence from left to right by adopting a sliding window with the size of n to take value.
The construction of the heterogeneous message graphs G (V, E) is specifically that,
step 2.1, regarding each message and each message segment as a graph node, wherein the number of nodes |V|=n, which is the sum of the number of messages plus the number of message segments; establishing edges between nodes when message fragments appear in a message; the word frequency-inverse text frequency index TF-IDF is used as the weight value of the message-segment edge, and the point mutual information PMI is used as the weight of the segment-segment edge, and the specific process is as follows:
wherein, # W(s) is the number of sliding windows comprising message segments s, # W (s, t) is the number of sliding windows comprising both messages Wen Pianduan s and message segments t, # W refers to the number of sliding windows over the entire set of message segments;
the way the edge weights are calculated is as follows:
wherein A represents an adjacency matrix of the heterogeneous message graph G, the value of the adjacency matrix is the calculated edge weight, and i and j represent graph nodes; when i and j are nodes of the message fragment type and the mutual information PMI value is positive, the weight of an edge formed between the two is the PMI value; when i, j are nodes of different types, a value obtained by adopting a TF-IDF calculation mode is used as a weight value of an edge formed between the i, j and the weight value; when i and j are the same node, the weight value of the edge is 1; when edges which do not meet the conditions exist, the weight value of the edges is taken as 0;
step 2.2 initializing each node with an m-dimensional one-hot vectorFeature vector, then the features of all nodes are represented as a matrixAnd generating a heterogeneous message diagram.
The message graph feature extraction neural network model Mess-GCN consists of two layers of graph rolling neural networks GCN, the embedded size of the second layer of graph rolling neural network nodes is the same as the number K of message types, and normalization processing is carried out through a softmax function, so that the category feature of each message node is generated:
and (3) for the first-layer graph convolution neural network GCN, obtaining new h-dimensional node characteristics:
wherein,representing a normalized symmetric adjacency matrix, W 0 Is a weight matrix, ρ represents an activation function,is the output layer.
The training message graph feature extraction neural network model Mess-GCN specifically comprises the following steps:
step 3.1, for the heterogeneous message graph G, inputting the heterogeneous message graph G into a Mess-GCN, and outputting K-dimensional feature vectors corresponding to all message nodesThe feature vector of the message segment node does not participate in the calculation of the next step;
step 3.2, clustering the messages, and carrying out the clusteringAs input of the message clustering stage, the loss of the Mess-GCN model is calculated, and the loss function is as follows:
wherein g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) represents sample x i ,x j Category characteristics l i ,l j A cosine similarity is adopted as a measurement standard in a similarity calculation mode; r is (r) ij Is an unknown binary variable; when r is ij When=1, it represents message x i ,x j Belongs to the same type, otherwise r ij =0;L(r ij ,g(x i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w)) is r ij And g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) represents model parameters of the Mess-GCN; r is (r) ij The value strategy of (2) is as follows:
wherein lambda is a superparameter of a Mess-GCN model and is used for controlling the selection of samples; mu (lambda) and eta (lambda) are the threshold for selecting similar samples and the threshold for selecting dissimilar samples, respectively, and satisfy the constraint mu (lambda) gtoreq eta (lambda); when g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) > μ (λ), two samples x i ,x j Similarly; when g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) is less than or equal to eta (lambda), the two samples are dissimilar samples; when r is ij Is None, represents the current sample (x i ,x j ,r ij ) Not participating in the training of the Mess-GCN model;
step 3.3, updating the weight matrix parameters of the Mess-GCN model, wherein the iterative rule of lambda is as follows:
wherein η is the learning rate of λ;
and step 3.4, continuously repeating the steps 3.1-3.3 until the Mess-GCN model reaches the maximum training round number.
The class cluster division is specifically to output class labels corresponding to each message, so that all the messages are divided into different class clusters according to the labels; the category assignment is as follows:
wherein t is i Represents the class cluster to which the current sample belongs, h is the class feature l i Subscripts of the elements in (2).
The Needleman-Wunsch algorithm comprises the following specific steps:
step 4.1, aiming at any two messages under the current class cluster, the lengths of the messages are respectively beta 12 Then establish a size of (beta) 1 +1)×(β 2 +1), the element initialized to 0;
and 4.2, judging whether each character in the two message sequences is the same, scoring and filling a scoring matrix according to a scoring rule, wherein the scoring rule is as follows:
wherein S is ij W represents the rewards and space penalty scores of character matching, respectively;
step 4.3, traversing according to a path from the right lower corner to the left upper corner of the scoring matrix, and constructing an aligned sequence;
and 4.4, repeatedly executing the steps 4.1-4.3 until all message sequences in the test message data set are aligned, and obtaining the boundary division of the protocol field by taking the space as a separator.
An unknown industrial control protocol reverse analysis device, comprising:
the network module is used for capturing network traffic data;
a memory for storing captured network traffic data and a computer program;
a processor for executing a computer program stored in the memory, the processor being configured to, when the computer program is executed:
obtaining message sequence data to be analyzed, wherein the message sequence data refers to network flow data only comprising an application layer protocol; segmenting the message sequence data to form a message segment set, and constructing a heterogeneous message diagram aiming at the message sequence data and the message segments;
inputting the constructed message graph into a graph feature extraction neural network model Mess-GCN, extracting the adjacent relation among all nodes, and generating a feature vector of the node; clustering based on the generated feature vectors and training the characteristics of the message graph to extract a loss function of the neural network model Mess-GCN; finally, determining the category to which the data belongs, dividing the category clusters, deducing the grammar format by adopting a Needleman-Wunsch algorithm under the same message category cluster, and marking the field boundaries.
The invention has the beneficial effects that: aiming at the characteristics of periodicity, messy network message content, abstract, difficult extraction of characteristic modes and the like of an industrial control protocol, the invention provides an unknown industrial control protocol reverse analysis method which is suitable for an industrial control system, and the key points mainly comprise the following three points:
(1) Different from the traditional method focusing on the sequence characteristics of the messages, the invention firstly proposes the assumption that the ICPs messages have the structure characteristics of the graphs, designs a message graph characteristic extraction model Mess-GCN based on the assumption, and reduces the time complexity of message clustering by using the graph neural network to jointly learn the fields and the low-dimensional characteristic embedding of the messages.
(2) And (3) carrying out joint optimization on feature extraction and clustering by using feedback information between subtasks ignored in the traditional sequential analysis flow. Based on the similarity measurement, the clustering problem is converted into the classification problem by judging whether the message sequence pairs belong to the same message type or not, and the classification problem and the potential representation learning are combined and optimized.
(3) Considering the situation that the industrial control protocol part field is only composed of a plurality of bits, a heuristic message segmentation algorithm is provided, and a more fine-grained and reasonable analysis unit is provided for the model.
Drawings
FIG. 1 is a flow chart of a reverse analysis method of an unknown industrial control protocol;
FIG. 2 is a schematic diagram of a neural network model for feature extraction of a message graph;
fig. 3 is a schematic diagram of an unknown industrial control protocol reverse analysis device.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples.
In this embodiment, four public industrial network traffic data sets in different industrial scenes are used to comprehensively evaluate the method provided by the invention, the protocol types of the method are Modbus/TCP, DNP3.0, S7Comm and Ethernet IP/CIP respectively, each type of network traffic data set contains more than 10 industrial control commands, and the number of messages exceeds 1000. The specific data distribution is shown in table 1.
Table 1 industrial control protocol network traffic data
For each type of network flow data set, the Wireshark is adopted for duplication removal and analysis, and is used as a real protocol field division fact for evaluating the accuracy of the method. In addition, 80% of the data will be used for training and 20% of the data will be used for testing.
An unknown protocol reverse analysis method for an industrial control network comprises the following algorithm steps:
step one: the network message forms a message fragment set based on a heuristic message segmentation algorithm, and the specific steps are as follows:
and 1.1, extracting application layer data of the network message to obtain a hexadecimal format message sequence.
And step 1.2, preliminarily carrying out field inference on all message sequences based on heuristic rules to obtain a public field.
Part of the heuristic rules are as follows:
a) Identification protocol identifier field: in view of the fixed nature of the content, length and location of the protocol identifier field, the protocol identification field can be identified by determining whether the first 4 bytes of the multiple message sequences are identical;
b) Identifying a length field: firstly, dividing messages into different groups according to the length, and then judging whether values of 2 or 4 continuous bytes at the same position are the same or not under the same group of messages. If the same value is present in the same position in the same group of messages, but different values are present in the same position between different groups of messages, and the value is exactly the length of the corresponding message in the group, then the consecutive 2 or 4 bytes can be identified as a length field.
And step 1.3, carrying out position coding on the rest part of the message sequence, and adopting 4-gram to obtain a series of message subsequence fragments. The original 16-system sequence message is as follows:
54 32 00 4e
and (3) obtaining a sequence after position information coding: 054 132, 200, 34e. When n takes 4, using a 4-gram, the following fragments will be generated: 5432,4320,3200,2004,004e.
And 1.4, carrying out de-duplication operation on the public field and the message sub-fragments to form a message fragment set, and obtaining the set size v.
Step two: the message and the message fragment set form a data set, and a heterogeneous message graph G (V, E) is constructed according to the data set:
step 2.1 regards each message and each fragment as nodes (node number |v|=n, message number plus fragment number), and establishes edges between nodes according to whether fragments appear in the message or not. The specific process of using word frequency-inverse text frequency index (TF-IDF) as the weight value of message-segment edge and using Point Mutual Information (PMI) as the weight of segment-segment edge is as follows:
wherein, # W(s) is the number of sliding windows containing segment s (the size of the sliding window is 15), and # W (s, t) is the number of sliding windows containing segment s and t at the same time, and # W is the number of sliding windows on the whole message set. In summary, the manner of calculating the edge weights is as follows:
wherein a represents the adjacency matrix of fig. G.
Step 2.2 after patterning, initializing feature vectors of each node with v-dimension one-hot vectors, and then feature of all nodes can be expressed as a matrix
Step three: the heterogeneous message graphs G (V, E) are input into a message graph feature extraction neural network model Mess-GCN for message clustering, and the message graph feature extraction neural network model Mess-GCN is trained, and the structure of the neural network model Mess-GCN is shown in figure 3. The method comprises the following specific steps:
step 3.1, inputting the input data graph G into a multi-layer neural network GCN, and capturing characteristic information of adjacent nodes. For a layer of GCN, the calculation method for obtaining the new h-dimensional node characteristics is as follows:
wherein the method comprises the steps ofRepresenting a normalized symmetric adjacency matrix, W 0 Is a weight matrix, ρ represents an activation function, such as ReLU function ρ (x) =max (0, x), +.>Is the output layer. In order to capture the information of the adjacent nodes of a higher layer, the invention builds a two-layer GCN neural network, so that the node embedding dimension of a first layer is 64, the node embedding dimension of a second layer is the number k of message categories, and the normalization processing is carried out through a function softmax, thereby generating the category characteristics of each message sequence:
step 3.2, clustering the messages, and carrying out message clusteringAs input to the message clustering stage, its loss function is as follows:
wherein g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) represents sample x i ,x j Category characteristics (i.e. l i ,l j ) The invention selects cosine similarity in the similarity calculation mode. r is (r) ij Is an unknown binary variable. When r is ij When=1, it represents message x i ,x j Belongs to the same type, otherwise r ij =0。L(r ij ,g(x i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w)) is r ij And g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) representing model parameters. r is (r) ij The value strategy of (2) is as follows:
wherein lambda is a superparameter of a Mess-GCN model,for controlling the selection of samples. Mu (lambda) and eta (lambda) are thresholds for selecting similar samples and dissimilar samples respectively, and the constraint mu (lambda) is more than or equal to eta (lambda) is satisfied, and the values are 0.99 and 0.95 respectively. When r is ij Is None, represents the current sample (x i ,x j ,r ij ) Does not participate in the training of the model.
Step 3.3, updating model weight matrix parameters, wherein the iterative rule of lambda is as follows:
where η=0.01 is the learning rate of λ.
Step 3.4 steps 3.1-3.3 are repeated until the model reaches 400 rounds of maximum training rounds.
And 3.5, outputting class labels corresponding to each message sequence, so as to divide the message set into different class clusters. The category assignment is as follows:
wherein t is i Represents the class cluster to which the current sample belongs, h is the class feature l i Subscripts of the elements in (2).
Step four: and obtaining class cluster division of test data by loading model weight matrix parameters, deducing a grammar format under the same message class cluster by adopting a Needleman-Wunsch algorithm, and marking field boundaries. The Needleman-Wunsch algorithm comprises the following specific steps:
in step 4.1, taking two messages under the current class cluster as examples, assuming that the lengths of the two messages are n and m respectively, a matrix with the size of (n+1) x (m+1) is established, and the element is initialized to 0.
And 4.2, judging whether each character in the two message sequences is the same or not, scoring and filling a scoring matrix according to a scoring rule, wherein the scoring rule is as follows:
wherein S is ij W represents the prize and space penalty score for character matching, respectively.
And 4.3, traversing according to a path from the lower right corner to the upper left corner of the scoring matrix, and constructing an aligned sequence.
And 4.4, repeatedly executing the steps 4.1-4.3 until all message sequences in the test message data set are aligned, and obtaining the boundary division of the protocol field by taking the space as a separator.
In order to verify the effectiveness of the framework of the invention, the system model is compared with 2 protocol reverse analysis methods, including more classical Netzob and more advanced NetPlier, by a set of experiments, and the format inference results on various data sets of different protocol types are compared.
We trained and analyzed one protocol at a time, 80% of the data was used for training and 20% of the data was used for testing. Specifically, firstly, constructing a heterogeneous message graph (including message nodes and field nodes) for the whole data set, generating a corresponding training set mask and a corresponding testing set mask, then training a model by using training data, loading the model, inputting testing data to obtain a type label of each message in the testing set, and dividing the testing set into different class clusters according to the type label. And finally, under each class cluster, using a sequence alignment algorithm to obtain the field boundary of each message sequence. The invention uses accuracy ACC, adjust Rand index ARI, standardized mutual information NMI as evaluation index of message clustering stage, and uses accuracy Corr and perfection Perf as evaluation index of protocol field boundary division. The experimental results are shown in tables 2 and 3.
TABLE 2 message clustering effects
TABLE 3 Industrial control protocol Format inference effects
In the aspect of message clustering, as the invention extracts the graph structural features existing in a single message and performs joint learning with a clustering stage, better message hidden layer representation is obtained and better clustering effect is obtained. Compared with the traditional method of manually extracting the message characteristics and clustering the messages by using a multi-step assembly line, the clustering model provided by the invention is integrally superior to a baseline model. Because the protocol specifications of Modbus and DNP3.0 are much simpler than those of S7Comm, ethernetIP/CIP, the internal structural features are not as rich as the latter, and the model provided by the invention mainly focuses on the internal structural features of a single message. Therefore, the clustering effect of the invention on Modbus and DNP3.0 data sets has the condition of low individual indexes, but the overall effect is still better than that of a baseline method. In particular, on the S7Comm, ethernetIP/CIP data set, at least 15% improvement is obtained for each type of index. In addition, in order to embody the advantages of the invention obtained in changing the processing flow, we train the feature extraction and clustering process separately and analyze on various protocols in a Pipeline manner, namely, compare method Pipeline. It can be observed that all of the indicators are weaker than the present invention, but the overall performance is still better than the baseline approach.
In order to embody the advantages of the clustering method, we further infer protocol formats under the obtained clustering results. As can be seen from Table 3, the method is superior to the baseline method, and similar to the clustering result, individual indexes on Modbus and DNP3.0 data sets are lower, and various indexes on S7Comm, ethernetIP/CIP data sets are improved. In view of the defects of the protocol reverse analysis technology based on network messages, all methods are low in index perf.
In view of the observation, the analysis mode of the combined optimization of the graph structural features and the subtasks, which is considered in the invention, provides a message graph feature extraction model and a progressive report Wen Julei model, which both play a positive role in the final protocol reverse analysis result, so that the protocol format inference accuracy under various data sets is up to 77.12%.
The above preferred embodiments are only for illustrating the technical concept and features of the present invention, and are intended to enable those skilled in the art to understand the present invention and to implement the same, but are not intended to limit the scope of the present invention, and all equivalent changes or modifications made according to the essence of the present invention are included in the scope of the present invention.

Claims (10)

1. The reverse analysis method of the unknown industrial control protocol is characterized by comprising the following steps of:
forming a message fragment set by the network message based on a heuristic message segmentation algorithm; the message and the message fragment set form a data set, and a heterogeneous message graph G (V, E) is constructed according to the data set; inputting the heterogeneous message graphs G (V, E) to a message graph feature extraction neural network model Mess-GCN for message clustering and training the message graph feature extraction neural network model Mess-GCN; the data set to be clustered is input into a trained message graph feature extraction neural network model (Mess-GCN), class cluster division is carried out, a Needleman-Wunsch algorithm is adopted to infer a grammar format under the same message class cluster, and a field boundary is marked.
2. The unknown industrial control protocol reverse analysis method according to claim 1, wherein the forming the message segment set by the network message based on the heuristic message segmentation algorithm specifically comprises:
step 1.1, extracting application layer data of a network message to obtain a hexadecimal format message sequence, wherein the message sequence is K types in total;
step 1.2, carrying out field inference on all message sequences based on heuristic rules to obtain a public field; the common field comprises a protocol identifier and a length field;
step 1.3, performing position coding on the rest part of the message sequence except the common field, and adopting an n-gram to obtain a plurality of message fragments;
and 1.4, performing de-duplication operation on the public fields and the message fragments to form a message fragment set.
3. The unknown industrial control protocol reverse analysis method according to claim 2, wherein the heuristic rules are as follows:
rule.a) identification protocol identifier field: the character content of the first N bytes of the plurality of message sequences is the same, and the character content of the first N bytes is a protocol identification field;
rule.b) identify length field: dividing the message sequence into different groups according to the length, and judging whether the values Q of the continuous M bytes at the same position are the same or not under the same group of messages; when the same position in the same group of messages has the same value Q, and the same position among different groups of messages has different values Q, and Q is the length of the message corresponding to the group, the continuous M bytes are identified as a length field.
4. The unknown industrial control protocol reverse analysis method according to claim 2, wherein the specific operation procedure of step 1.3 is as follows: and splicing the number of the position of each byte in the original message sequence with the character content of the corresponding byte, and sliding the original message sequence from left to right by adopting a sliding window with the size of n to take value.
5. The reverse analysis method according to any one of claims 2 to 4, wherein the constructing the heterogeneous message map G (V, E) is specifically,
step 2.1, regarding each message and each message segment as a graph node, wherein the number of nodes |V|=n, which is the sum of the number of messages plus the number of message segments; establishing edges between nodes when message fragments appear in a message; the word frequency-inverse text frequency index TF-IDF is used as the weight value of the message-segment edge, and the point mutual information PMI is used as the weight of the segment-segment edge, and the specific process is as follows:
wherein, # W(s) is the number of sliding windows comprising message segments s, # W (s, t) is the number of sliding windows comprising both messages Wen Pianduan s and message segments t, # W refers to the number of sliding windows over the entire set of message segments;
the way the edge weights are calculated is as follows:
wherein A represents an adjacency matrix of the heterogeneous message graph G, the value of the adjacency matrix is the calculated edge weight, and i and j represent graph nodes; when i and j are nodes of the message fragment type and the mutual information PMI value is positive, the weight of an edge formed between the two is the PMI value; when i, j are nodes of different types, a value obtained by adopting a TF-IDF calculation mode is used as a weight value of an edge formed between the i, j and the weight value; when i and j are the same node, the weight value of the edge is 1; when edges which do not meet the conditions exist, the weight value of the edges is taken as 0;
step 2.2, initializing the feature vector of each node by using the m-dimensional one-hot vector, and expressing the features of all nodes as a matrixAnd generating a heterogeneous message diagram.
6. The method for reverse analysis of unknown industrial control protocol according to claim 5, wherein the message graph feature extraction neural network model message-GCN is composed of two layers of graph roll-up neural networks GCN, the size of the second layer of graph roll-up neural network nodes embedded is the same as the number K of message types, and normalization processing is performed through a softmax function, so as to generate category features of each message node:
and (3) for the first-layer graph convolution neural network GCN, obtaining new h-dimensional node characteristics:
wherein,representing a normalized symmetric adjacency matrix, W 0 Is a weight matrix, ρ represents an activation function, +.>Is the output layer.
7. The unknown industrial control protocol reverse analysis method according to claim 6, wherein the training message graph feature extraction neural network model message-GCN specifically comprises:
step 3.1, for the heterogeneous message graph G, inputting the heterogeneous message graph G into a Mess-GCN, and outputting K-dimensional feature vectors corresponding to all message nodesThe feature vector of the message segment node does not participate in the calculation of the next step;
step 3.2, clustering the messages, and carrying out the clusteringAs input of the message clustering stage, the loss of the Mess-GCN model is calculated, and the loss function is as follows:
wherein g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) represents sample x i ,x j Category characteristics l i ,l j A cosine similarity is adopted as a measurement standard in a similarity calculation mode; r is (r) ij Is an unknown binary variable; when r is ij When=1, it represents message x i ,x j Belongs to the same type, otherwise r ij =0;L(r ij ,g(x i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w)) is r ij And g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) represents model parameters of the Mess-GCN; r is (r) ij The value strategy of (2) is as follows:
wherein lambda is a superparameter of a Mess-GCN model and is used for controlling the selection of samples; mu (lambda) and eta (lambda) are the threshold for selecting similar samples and the threshold for selecting dissimilar samples, respectively, and satisfy the constraint mu (lambda) gtoreq eta (lambda); when g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) > μ (λ), two samples x i ,x j Similarly; when g (x) i ,x j The method comprises the steps of carrying out a first treatment on the surface of the w) is less than or equal to eta (lambda), the two samples are dissimilar samples; when r is ij Is None, represents the current sample (x i ,x j ,r ij ) Not participating in the training of the Mess-GCN model;
step 3.3, updating the weight matrix parameters of the Mess-GCN model, wherein the iterative rule of lambda is as follows:
wherein η is the learning rate of λ;
and step 3.4, continuously repeating the steps 3.1-3.3 until the Mess-GCN model reaches the maximum training round number.
8. The method for reverse analysis of unknown industrial control protocol according to claim 6, wherein the classification of the clusters is specifically that a class label corresponding to each message is output, so that all the messages are classified into different clusters according to the labels; the category assignment is as follows:
wherein t is i Represents the class cluster to which the current sample belongs, h is the class feature l i Subscripts of the elements in (2).
9. The unknown industrial control protocol reverse analysis method according to claim 8, wherein the Needleman-Wunsch algorithm specifically comprises the following steps:
step 4.1, aiming at any two messages under the current class cluster, the lengths of the messages are respectively beta 12 Then establish a size of (beta) 1 +1)×(β 2 +1), the element initialized to 0;
and 4.2, judging whether each character in the two message sequences is the same, scoring and filling a scoring matrix according to a scoring rule, wherein the scoring rule is as follows:
wherein S is ij W represents the rewards and space penalty scores of character matching, respectively;
step 4.3, traversing according to a path from the right lower corner to the left upper corner of the scoring matrix, and constructing an aligned sequence;
and 4.4, repeatedly executing the steps 4.1-4.3 until all message sequences in the test message data set are aligned, and obtaining the boundary division of the protocol field by taking the space as a separator.
10. An unknown industrial control protocol reverse analysis device, comprising:
the network module is used for capturing network traffic data;
a memory for storing captured network traffic data and a computer program;
a processor for executing a computer program stored in the memory, the processor being configured to, when the computer program is executed:
obtaining message sequence data to be analyzed, wherein the message sequence data refers to network flow data only comprising an application layer protocol; segmenting the message sequence data to form a message segment set, and constructing a heterogeneous message diagram aiming at the message sequence data and the message segments;
inputting the constructed message graph into a graph feature extraction neural network model Mess-GCN, extracting the adjacent relation among all nodes, and generating a feature vector of the node; clustering based on the generated feature vectors and training the characteristics of the message graph to extract a loss function of the neural network model Mess-GCN; finally, determining the category to which the data belongs, dividing the category clusters, deducing the grammar format by adopting a Needleman-Wunsch algorithm under the same message category cluster, and marking the field boundaries.
CN202311243871.1A 2023-09-26 2023-09-26 Reverse analysis method and device for unknown industrial control protocol Pending CN117354207A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311243871.1A CN117354207A (en) 2023-09-26 2023-09-26 Reverse analysis method and device for unknown industrial control protocol

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311243871.1A CN117354207A (en) 2023-09-26 2023-09-26 Reverse analysis method and device for unknown industrial control protocol

Publications (1)

Publication Number Publication Date
CN117354207A true CN117354207A (en) 2024-01-05

Family

ID=89368247

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311243871.1A Pending CN117354207A (en) 2023-09-26 2023-09-26 Reverse analysis method and device for unknown industrial control protocol

Country Status (1)

Country Link
CN (1) CN117354207A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117640476A (en) * 2024-01-23 2024-03-01 中国人民解放军61660部队 Small sample application layer protocol identification method based on relational network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117640476A (en) * 2024-01-23 2024-03-01 中国人民解放军61660部队 Small sample application layer protocol identification method based on relational network

Similar Documents

Publication Publication Date Title
CN112508085B (en) Social network link prediction method based on perceptual neural network
CN112035669A (en) Social media multi-modal rumor detection method based on propagation heterogeneous graph modeling
CN111327608B (en) Application layer malicious request detection method and system based on cascade deep neural network
US20120210426A1 (en) Analysis system for unknown application layer protocols
CN111314279B (en) Unknown protocol reverse method based on network flow
CN114553983B (en) Deep learning-based high-efficiency industrial control protocol analysis method
CN117354207A (en) Reverse analysis method and device for unknown industrial control protocol
CN112468347A (en) Security management method and device for cloud platform, electronic equipment and storage medium
Thaler et al. Towards a neural language model for signature extraction from forensic logs
CN111651566A (en) Multi-task small sample learning-based referee document dispute focus extraction method
CN115659966A (en) Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention
CN115357904A (en) Multi-class vulnerability detection method based on program slice and graph neural network
CN116561748A (en) Log abnormality detection device for component subsequence correlation sensing
CN116502162A (en) Abnormal computing power federal detection method, system and medium in edge computing power network
CN117670571B (en) Incremental social media event detection method based on heterogeneous message graph relation embedding
CN104468276A (en) Network traffic identification method based on random sampling multiple classifiers
CN112015890A (en) Movie scenario abstract generation method and device
CN113852605B (en) Protocol format automatic inference method and system based on relation reasoning
CN115334179B (en) Unknown protocol reverse analysis method based on named entity recognition
CN111159370A (en) Short-session new problem generation method, storage medium and man-machine interaction device
CN114386436B (en) Text data analysis method, model training method, device and computer equipment
CN113177164B (en) Multi-platform collaborative new media content monitoring and management system based on big data
CN115767546A (en) 5G network security situation assessment method for quantifying node risks
CN115878800A (en) Double-graph neural network fusing co-occurrence graph and dependency graph and construction method thereof
CN113342982B (en) Enterprise industry classification method integrating Roberta and external knowledge base

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination