CN115242424A

CN115242424A - Private network protocol classification method based on state machine subgraph isomorphic matching

Info

Publication number: CN115242424A
Application number: CN202210607396.0A
Authority: CN
Inventors: 陈烨; 宋宇波; 蔡义涵
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-10-25

Abstract

The invention discloses a private protocol classification method based on isomorphic matching of a protocol state machine sub-graph. In the generation stage of the private protocol state machine, firstly, analyzing the captured equipment network flow to extract a binary sequence of a protocol application layer, then analyzing protocol field contents from the binary sequence by adopting an alignment algorithm, further realizing merging of similar fields by utilizing a clustering algorithm to obtain a private protocol message type and a message format, and finally, carrying out state merging from equipment interaction messages by adopting an augmented prefix tree structure to generate the protocol state machine; and in the stage of matching and classifying the state machine, matching the state machine diagram with the known standard protocol state machine by adopting a sub-graph isomorphic matching algorithm to realize the rapid classification of the private protocol. The invention can help the equipment protocol analyst to quickly locate the standard protocol which is referred to when the private protocol design which is defined by the network equipment manufacturer is designed.

Description

Private network protocol classification method based on state machine subgraph isomorphic matching

Technical Field

The invention belongs to the field of network security, and particularly relates to a private protocol identification method based on isomorphic matching of a protocol state machine sub-graph.

Background

The future industrial internet is a brand new industrial ecology, key infrastructure and a novel application mode which are deeply integrated by a new generation of information communication network technology and industrial manufacturing, realizes the comprehensive connection of all production elements, all industrial chains and all value chains through the safe and reliable intelligent connection of human-computer objects, promotes the fundamental change of the production mode and the enterprise form of the manufacturing industry, forms a brand new industrial manufacturing and service system, and obviously improves the digital, networked and intelligent development level of the manufacturing industry. However, the security problem is gradually exposed, and the increasingly frequent industrial internet security events indicate that a large number of security holes and hidden dangers exist in the industrial internet. In order to reduce the occurrence of future safety events of the industrial control system and ensure that the data transmission of the industrial control system in the communication process is safer, the transmission protocol used in the industrial control system is taken as a target, possible safety problems are researched, possible bugs in the industrial control protocol are excavated, and the safety of the industrial control system is further maintained, so that the industrial control system has important significance.

Based on commercial and security, most protocols used in the industrial internet do not disclose standard documents, and nonstandard unknown private industrial control protocols bring huge challenges to network behavior management, intrusion detection, fuzzy test and the like of the industrial internet. In order to perform security research on these protocols, it is necessary to acquire knowledge about these protocols, and a common method is a protocol reverse engineering.

A typical example of the Protocol reverse technology is Protocol information project started by Marshall beddo in 2004, a PI project searches for a sequence with special significance from a large amount of network data streams, and a Protocol structure is obtained through comparative analysis of the large amount of network data. The scholars have made a lot of research in succession based on the PI project, and the automatic extracting tool scriptGen of the Honeyd configuration script of Leita et al is based on the algorithm flow of the PI project. Another example of a significant message sequence analysis method is the protocol inversion scheme discover proposed by Cui et al. The scheme takes the domain as a motif to carry out sequence alignment, and is more targeted. After that, with the continuous maturity of artificial intelligence technologies such as natural language processing and machine learning, such protocol reverse analysis technologies have been developed rapidly, and not only the research objects are expanded from text protocols to binary protocols, but also the types of methods are rich, such as methods based on sequence alignment, such as NetZob, MSA-HMM, etc., methods based on probabilistic models, such as Biprominer, proGraph, hsMM, etc., methods based on frequent set, autoreenine method, and FieldHunter, etc., based on semantic analysis. In recent years, new methods such as Netplier using a protocol keyword probability extraction algorithm based on a clustering effect metric have been proposed.

In the last few years, the interest in graphs has grown larger and larger in scientific communities that are active in the fields of pattern analysis, pattern recognition and computer vision, and applications using graphs have also multiplied. Graphs are commonly used to provide a structural description of an image by decomposing the image into parts and associating nodes and edges of the graph into components and their relationships. From the point of view of pattern recognition, the most important problem in graph processing is matching graph or sub-graph comparison. Subgraph isomorphism is a very important problem in graph theory, and some related papers propose to solve the problem in years. Given two graphs G1= (V1, E1), G2= (V2, E2) and one mapping

G1 and G2 are said to be isomorphic if and only if M is a bijection and the corresponding edge is also a bijection. If it is not

If a subgraph of G1 and a subgraph of G2 are in double shot, the subgraphs of G1 and G2 are called isomorphism. The backtracking algorithm proposed by Ullmann is the earliest algorithm which can find the isomorphism of the subgraph, and the process of searching the space size can be obviously reduced. Other algorithms such as the Nauty algorithm, the SD algorithm, the VF algorithm, and the VF2 algorithm modified over the VF algorithm.

The existing non-standard unknown private industrial control protocol is generally improved by adopting a relatively mature public protocol, so that the efficiency of protocol security research can be improved if the public protocol of the unknown protocol can be identified. The conventional method cannot quickly and effectively identify the open protocol of an unknown protocol. In order to solve the problems, the invention provides a private protocol classification method based on isomorphic matching of protocol state machine sub-graphs by utilizing the characteristic that graph structures between state machines of a protocol to be analyzed and a referenced standard protocol are highly similar, namely the state machine obtained by reverse analysis from captured flow is only a sub-state machine of the original protocol. The patent uses a sub-graph isomorphic algorithm matching to accomplish protocol identification.

Disclosure of Invention

The invention aims to: the technical problem to be solved by the invention is how to help a device protocol analyst quickly locate a standard protocol which is referred to when designing a private protocol defined by a network device manufacturer, thereby providing help for further analyzing the protocol. The invention provides a private network protocol classification method based on isomorphic matching of state machine sub-graphs, which only analyzes the captured private protocol flow to identify the problem of protocol type.

The above purpose is realized by the following technical scheme:

a method for classifying private network protocols based on state machine subgraph isomorphic matching comprises a private protocol state machine generation stage and a state machine matching classification stage, and comprises the following specific steps:

stage one: protocol state machine generation phase

Step 1: message alignment

Step 1.1: analyzing the captured device network flow, and extracting a binary sequence of a protocol application layer part;

step 1.2: analyzing the content of the protocol field from the binary sequence by using a sequence alignment algorithm to obtain aligned protocol similar fields serving as key field candidate fields;

and 2, step: packet clustering

The step further utilizes a clustering algorithm to realize merging of similar fields to obtain the type and the format of the private protocol message;

step 2.1: establishing a matching score table

Firstly, calculating the matching score of each message sequence with other message sequences by the method of the step 1.2, and establishing a matching score table for each message sequence;

step 2.2: hierarchical clustering

Screening out a message sequence with the highest score from the matching score table, classifying the two messages into one class for the matching score of the two sequences, then detecting the triple of each message sequence according to the method, and clustering by using a hierarchical clustering algorithm;

and 3, step 3: key field extraction

Extracting key fields, counting the distribution of each field in each cluster-generated class by a probability distribution statistical method, calculating the distribution variance, and determining the minimum key field as a message key field for identifying the message type generated by the cluster;

and 4, step 4: protocol state machine inference

The purpose of State Machine inference is to infer a Finite State Machine (FSM), which can recognize different message classes and fully describe the states of a protocol session and the conditions and paths of State transitions;

step 4.1: constructing augmented prefix trees

According to the results of message clustering and key field extraction, the message in the input conversation set is constructed into an Augmented Prefix Tree (APTA) according to the message sequence and the message type

The augmented prefix tree is a finite automaton that cannot be completely deterministic, and the state transition graph is represented by a tree. The augmented prefix tree is constructed from an initial state by separating the data streams into different sessions, each session representing a branch of the tree, using the same branch for the messages that coincide in both sessions. In the augmented prefix tree, all states are marked as null by default, so that excessive merging is easy to occur in the subsequent state merging process, and therefore, each state needs to be marked.

Step 4.2: status flag

The purpose of the status flag is to avoid that the status is excessively merged. The state labels extract a prerequisite model from the observed conversation, which is represented by a regular expression as shown in the following formula:

.*(r ₁ |...|r _k )(a ₁ |...|a _j )*，(r，a ₁ ，...，a _j ∈M)

the "-" in the formula represents any type of message, and the "-" represents any number of repetitions of the previous part, and states with such a model are called preconditions. r is ₁ ，...，r _k Indicating the type of message to be received, a ₁ ，...，a _j Indicating the type of message that can be received after r, a prerequisite indicates that the server only needs to receive r before executing the sequence of messages q ₁ ，...，r _k After receiving the message sequence q, the state of receiving the message sequence q can be reached;

once the prerequisites of all message sequences are expressed, labeling all states in the augmented prefix tree, wherein the label of each state is a set of all allowed input information types of the state;

step 4.3: state machine simplification

And a second stage: state machine matching classification phase

The problem of sub-graph isomorphism can be described as: given two graphs G1= (V1, E1), G2= (V2, E2) and one mapping

If and only if M is a bijection and the corresponding edge is also a bijection, G1 and G2 are called isomorphic, V1 denotes the set of points in G1, V2 denotes the set of points in G2, E1 denotes the set of edges in G1, E2 denotes the set of edges in G2, if M is a bijection and the corresponding edge is also a bijection, then V1 denotes the set of points in G1, E2 denotes the set of edges in G2, and if M is a bijection, then

If a subgraph of G1 is taken in a pair with a subgraph of G2, the subgraph of G1 and the subgraph of G2 are calledThe figures are isomorphic;

the subgraph isomorphism matching algorithm adopts a State Space Representation (SSR) of a matching process, allows grammatical and semantic comparison of node pairs to be matched at the same time, and uses feasibility rules to prune a search tree. The state space expression is composed of a state equation and an output equation, and a formula for completely expressing the control system in the state space can reflect the change of all independent variables of the system, thereby simultaneously determining all internal motion states of the system and conveniently processing initial conditions. With the state space expression, each state s of the matching process can be associated with a partial mapping M(s), which is a subset containing only the resulting mapping M. Two graphs G1= (V1, E1) for sub-graph isomorphism matching, and the portion made up of nodes where G2= (V2, E2) is associated with this partial map M(s) is referred to as G1(s) and G2(s). According to the definition in the algorithm, the conversion from the state s to the subsequent state s 'is to add a pair of matching nodes (n, m) to the part of the graph associated with the current state s in the state space expression, and further perform consistency check of the subsequent state to generate s'. The algorithm introduces a set of rules that can verify the consistency check, which can prove that, in the case of isomorphism or graph sub-graph isomorphism, the partial graphs G1(s) and G2(s) related to M(s) after passing the consistency condition check are isomorphism. In addition, by adding a set of K look-ahead rules (K-look-ahead rule), that is, if the consistent state s has no consistent subsequent state after K steps, the number of states generated in the process can be further reduced, so that the algorithm can greatly reduce the time complexity of the algorithm.

Step 1: converting the protocol state machine obtained in the first stage into a directed graph

And 2, step: and matching with a standard state machine of a protocol by using a subgraph isomorphism algorithm, wherein if the matching is successful, the protocol state machine is the subgraph isomorphism of the standard state machine, namely the captured protocol belongs to the protocol.

Step 2.1: graphs of isomorphism or subgraph isomorphism to be verified are G1 and G2, the initial state of the algorithm is s0, and the initial isomorphism mapping M (s 0) is an empty set.

Step 2.2: judging whether the current state s contains all points s of G2, wherein the state indicates that the current partial mapping M(s) meets isomorphism, if the associated mapping M(s) of the current state s contains all points in G2, outputting M(s), and finishing the algorithm; otherwise, the procedure goes to step 2.3,

step 2.3: calculating a candidate point pair set P(s) of graphs G1 and G2 to be verified isomorphism, selecting a point pair (n, m) from the P(s), wherein n belongs to G1 and m belongs to G2,

step 2.4: the rule traverses the point pairs in the P(s), judges whether the point pairs (n, m) are added into the mapping and updates s to be s' according to whether the current state feasibility rule F (s, n, m) is true or not;

(3) If the return value is true, the partial mapping M (s ') of the new state s' still satisfies isomorphism after the point pair (n, M) is added to the current state s;

(4) If the return value of F (s, n, m) is false, it indicates that (n, m) should not be added to the current state, and can play a role of pruning;

the F (s, n, m) consistency detection function is realized by giving a prospective node check rule, a subsequent node check rule, an adding node rule, a deleting node rule and a new state check rule;

jumping to step 2.2 after the point pairs in the P(s) are traversed; judging whether the algorithm is finished or continued in step 2.2; the associated mapping M(s) according to the final state s contains all points in G2 to verify whether G1, G2 are isomorphic or sub-graph isomorphism.

Further, the basic idea of the alignment and similarity calculation of the sequence alignment algorithm in step 1.2 of stage one is: using the scores to judge the matching degree of the two input character strings, giving a negative score if the two characters at the corresponding positions are different, giving a positive score if the two characters are the same, and finally settling the total scores of the two inputs;

the scoring rules for matching detection using the alignment algorithm for sequence alignment are as follows, with the algorithm defining three cases, MATCH, DISMATCH and INDEL:

performing primary matching on two basic domains (MATCH);

two basic domains are not matched (DISMATCH), and a negative score is obtained;

if the element at the corresponding position of one sequence is empty (INDEL), negative scores are also obtained.

Based on the penalty rule, a scoring matrix is constructed by the following formula:

carrying out message alignment on unknown protocols, and clustering to divide messages of the same type into the same type so as to carry out the next step of reversely outputting a state machine according to a message sequence;

further, the hierarchical clustering in step 2.2 of stage one is performed by pairwise comparison clustering method, and to measure the distance between clusters generated by clustering, a similarity measure criterion needs to be determined, assuming that C is _i ，C _j For two cluster clusters, according to similarity measure criterion, C in hierarchical clustering algorithm _i ，C _j The distance between them is:

step 2.2.1: initialization

In the initial stage, each sequence is divided into one class and is respectively used as a leaf node of a tree

Step 2.2.2: generating a system spanning tree

And combining the two latest sequences calculated according to the similarity measurement criterion to define a new node, and continuously repeating the process until all the sequences are added to obtain the system spanning tree.

Further, the stateful simplification method described in step 4.3 of phase one is to perform state simplification on the red-blue node framework, and the purpose of marking the state as a red-blue node is to determine the merging order. I.e. all nodes are divided into two categories: red nodes and blue nodes, the concrete simplifying steps are as follows:

firstly, traversing from a root node of an APTA (advanced persistent adaptive traffic adaptive routing), marking all the root nodes as red, marking all child nodes of the root nodes as blue, and taking other nodes as unknown nodes;

(2) An attempt is made to merge a red node and a blue node, which can be done if the labels applied at the state-marking stage are the same. When merging, firstly, all subtrees of the blue nodes are traversed, all the subtrees are added into the subtree set of the red nodes, and all the newly added subtrees are used as candidates and are marked as blue. If the merging fails due to different labels, the blue node is promoted to be a red node, and all subtrees are marked as blue.

(3) Steps 1 and 2 are repeated for all blue nodes that are not merged. In the process of processing, when merging, a blue node needs to be compared with all red nodes to determine whether merging can be performed, and if not, the blue node is promoted to be a red node.

(4) After all the blue nodes are merged, the state transition diagram obtained at this time is the result of state machine simplification, and is a minimized DFA.

Further, the construction rule of the candidate point pair set P(s) described in stage two step 2.3 is as follows:

out-terminal set

Is G ₁ In not to M ₁ (s) set of successor nodes to the node, where M ₁ (s) defining an in-terminal set for the portion of the mapping M(s) corresponding to state s associated with G1

Is not belonging to M ₁ (s) but belongs to M ₁ (s) a set of predecessor nodes of the nodes in(s); definition of same theory

And

M ₂ (s) is the portion of mapping M(s) corresponding to state s associated with G2;

(5) If it is not

And

are not all empty, and are not empty,

to represent

The node with the smallest label in the node (any ordering method can be used);

(6) If it is not

And

are all empty, and

and

are not all empty, and are not all empty,

(7) If all four terminal sets are empty, P(s) = V ₁ -M ₁ (s)×{min(V ₂ -M ₂ (s))}；

(8) When it appears that only one in-terminal set or only one out-terminal set is empty, it can be shown that it is not possible for state s to construct the final isomorphism, and therefore state s does not need to be analyzed any further.

Further, the look-ahead node checks the rule in step 2.4 of phase two: the consistency of the partial solution M(s) is checked by adding the considered candidate (n, M) to the current partial solution M(s).

And (3) checking a rule by the successor node: the consistency of the partial solution M(s) is checked after adding the considered candidate (n, M) to the current partial solution M(s).

Adding a node rule: the method is used for pruning the search tree in the process of adding the point pairs, and the search efficiency is improved.

And (3) deleting node rules: the method is used for pruning the search tree in the process of deleting the point pairs, and the search efficiency is improved.

New state checking rules: the method is used for pruning the search tree in the process of checking the generated new mapping, and the search efficiency is improved.

The technical scheme of the invention has the following advantages:

by combining the protocol reverse generation protocol state machine with the subgraph isomorphism matching, the state machine subgraph isomorphism matching-based private network protocol classification method is realized, the classification and judgment of the private protocol defined by the Internet of things equipment manufacturer are realized, and the equipment protocol analyst can be helped to quickly locate the standard protocol referred by the network equipment manufacturer in the design of the private protocol defined by the network equipment manufacturer, so that the help is provided for further analyzing the protocol.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 is an alignment process for building a scoring matrix in an embodiment of the present invention.

Fig. 3 is a diagram illustrating a result obtained by constructing a scoring matrix according to a backtracking path in an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention discloses a private protocol identification method based on isomorphic matching of a protocol state machine sub-graph, as shown in figure 1, the method comprises the following steps:

stage one: protocol state machine generation phase

Step 1: message alignment

Step 1.1: and analyzing the captured equipment network flow, and extracting a binary sequence of the protocol application layer part.

Step 1.2: and analyzing the content of the protocol field from the binary sequence by using a sequence alignment algorithm to obtain aligned protocol similar fields serving as key field candidate fields.

The basic idea of alignment and similarity calculation by a sequence alignment algorithm is as follows: and judging the matching degree of the two input character strings by using the scores, giving a negative score if the two characters at the corresponding positions are different, giving a positive score if the two characters are the same, and finally settling the total scores of the two inputs.

making a score of +1 for two basic domains Matching (MATCH);

two fundamental domain mismatches (DISMATCH), score-1;

if the element at the corresponding position of one sequence is empty (INDEL), the score is also-1

Constructing a scoring matrix by the following formula based on a penalty rule, constructing a two-dimensional matrix when two sequences are aligned, firstly filling the 0 th row and the 0 th column of the matrix with characters of S1 and S2 respectively, and expressing the row number and the column number of the matrix with i and j respectively, wherein indel, match and mismatch are score changes:

for example, the sequence S1= GGATCGA and the sequence S2= GAATTCAGTTA, "" indicates a space. Constructing a two-dimensional matrix, firstly filling the 0 th row and the 0 th column of the matrix with S1 and S2 respectively, during filling, paying attention to reserve two characters, making (1,1) = 0, then adding a penalty of INDEL to each longitudinal unit and each transverse unit compared with the previous unit, constructing the whole scoring matrix through a formula, and finally, starting from the lowest right corner of the matrix and backtracking upwards. Where the path is up to the left, MATCH/DISMATCH, INDEL is present for S1 for the path to the left and for S2 for the path up, using '-' instead. Finally, the result shown in fig. 2 is obtained, and is a global optimal solution, and the lower right corner of the matrix represents the similarity of two sequences. And finally obtaining the result shown in the figure 3 according to the backtracking path.

Step 2: packet clustering

The step further utilizes a clustering algorithm to realize the merging of similar fields to obtain the proprietary protocol message type and the message format.

Step 2.1: establishing a matching score table

For each message sequence, a matching score with other message sequences is calculated by the method of step 1.2, and a matching score table is established for each message sequence.

The matching score table is shown in table 1, and then the message sequence with the highest score is screened from the table to construct a triple<S _i ，S _j ，H _ij >(i≠j)，S _j Representation and message sequence S _i Matching the highest scoring message sequence, H _ij For the matching scores of these two sequences, the message sequence S can be seen from the table ₁ The highest score of matching is S ₃ If the score is 15, the triplet is<S ₁ ，S ₃ ，15>The triplets of each message sequence are then detected.

Table 1 examples of respective sequence matching scores

Step 2.2: hierarchical clustering

And screening the message sequence with the highest score from the matching score table, and classifying the two messages into one class for the matching scores of the two sequences. In this way, triplets for each message sequence are then detected and clustered using a hierarchical clustering algorithm. The hierarchical clustering is carried out by a pairwise comparison clustering method. To measure the distance between clusters generated by clustering, a similarity metric criterion needs to be determined. Assuming that Ci and Cj are two clustering clusters, according to the similarity measurement criterion, the distance between Ci and Cj in the hierarchical clustering algorithm is:

step 2.2.1: initialization

Each sequence in the initial stage is divided into one class and is respectively used as a leaf node of a tree

Step 2.2.2: generating a system spanning tree

And 3, step 3: key field extraction

And extracting the key fields, counting the distribution of each field in the cluster-generated class by a probability distribution statistical method, calculating the distribution variance, and determining the minimum key field of the message to identify the type of the message generated by the cluster.

And 4, step 4: protocol state machine inference

The purpose of State Machine inference is to infer a Finite State Machine (FSM) that recognizes different message classes and fully describes the states of a protocol session and the conditions and paths of State transitions.

Step 4.1: constructing augmented prefix trees

Step 4.2: status flag

The purpose of the state flag is to avoid the states being overly consolidated. The state labels extract a prerequisite model from the observed conversation, which can be represented by a regular expression as shown in the following formula:

.*(r ₁ |...|r _k )(a ₁ |...|a _j )*

"in the formula denotes any type of message and" x "denotes any number of repetitions of the previous part. Prerequisite indicates that the server only needs to receive r before executing the message sequence m ₁ ，...，r _k Of any type of message, before reaching the state of accepting the message sequence m.

Once all the prerequisites for a message sequence are expressed, all the states in the augmented prefix tree are labeled.

The label for each state is the set of all allowed input information types for that state.

Step 4.3: state machine simplification

The state machine simplification method is to simplify the state on a red and blue node frame, namely to divide all nodes into two types: red nodes and blue nodes, the concrete simplification steps are as follows:

and secondly, attempting to merge a red node and a blue node, wherein merging can be performed if the labels applied by the two nodes in the state marking stage are the same. When merging, firstly traversing all subtrees of the blue nodes, adding all subtrees into the subtree aggregate of the red nodes, and taking all newly added subtrees as candidates and marking the candidates as blue. If the merging fails due to different labels, the blue node is promoted to be a red node, and all subtrees are marked as blue.

And thirdly, repeating the

steps

1 and 2 for all blue nodes which are not combined. In the process of processing, when merging, a blue node needs to be compared with all red nodes to determine whether merging can be performed, and if not, the blue node is promoted to be a red node.

After all the blue nodes are merged, the state transition diagram obtained at this time is the result of state machine simplification, and is a minimized DFA.

And a second stage: state machine matching classification phase

If a subgraph of G1 is double-shot with a subgraph of G2, the subgraphs of G1 and G2 are isomorphic.

The subgraph isomorphism matching algorithm adopts a State Space Representation (SSR) of a matching process, allows grammatical and semantic comparison of node pairs to be matched at the same time, and uses feasibility rules to prune a search tree. The state space expression is composed of a state equation and an output equation, and a formula for completely expressing the control system in the state space can reflect the change of all independent variables of the system, thereby simultaneously determining all internal motion states of the system and conveniently processing initial conditions. With the state space expression, each state s of the matching process can be associated with a partial mapping M(s), which is a subset containing only the resulting mapping M. Two graphs G1= (V1, E1) for sub-graph isomorphism matching, and the portion made up of nodes where G2= (V2, E2) is associated with this partial map M(s) is referred to as G1(s) and G2(s). According to the definition in the algorithm, the conversion from the state s to the subsequent state s 'is to add a pair of matching nodes (n, m) to the part of the graph associated with the current state s in the state space expression, and further perform consistency check of the subsequent state to generate s'. The algorithm introduces a set of rules that can verify the consistency check, which can prove that, in the case of isomorphism or graph sub-graph isomorphism, the parts of graphs G1(s) and G2(s) related to M(s) after passing the consistency condition check are isomorphism. In addition, by adding a set of K look-ahead rules (K-look-ahead rule), that is, if the consistent state s has no consistent subsequent state after K steps, the number of states generated in the process can be further reduced, so that the algorithm can greatly reduce the time complexity of the algorithm.

Step 2: and matching with a standard state machine of a protocol by using a subgraph isomorphism algorithm, wherein if the matching is successful, the protocol state machine is the subgraph isomorphism of the standard state machine, namely the captured protocol belongs to the protocol.

Step 2.2: and judging whether the current state s contains all the states of the points s of G2, wherein the current partial mapping M(s) is isomorphic. If the associated mapping M(s) for the current state s contains all the points in G2, M(s) is output and the algorithm ends. Otherwise step 2.3 is entered.

Step 2.3: and calculating a candidate point pair set P(s) of graphs G1 and G2 to be verified, and selecting a point pair (n, m) from the P(s), wherein n belongs to G1, and m belongs to G2.

The construction rule of the candidate point pair set P(s) is as follows:

out-terminal set

Is G ₁ In not to M ₁ (s) a set of successor nodes to the nodes in(s). Defining in-terminal collections

Is not in M ₁ (s) but belongs to M ₁ A set of nodes preceding the node in(s). Define the same principle

And

(9) If it is not

And

are not all empty, and are not empty,

to represent

The node with the smallest label in (any ordering method).

(10) If it is not

And

are all empty, and

and

are not all empty, and are not all empty,

(11) If all four terminal sets are empty, P(s) = V ₁ -M ₁ (s)×{min(V ₂ -M ₂ (s))}。

(12) When it appears that only one in-terminal set or only one out-terminal set is empty, it can be shown that it is not possible for state s to construct the final isomorphism, and therefore state s does not need to be analyzed any further.

Step 2.4: the rule traverses the point pairs in P(s), determines whether the point pair (n, m) is added to the map and updates s to s' by determining whether the current state feasibility rule F (s, n, m) is true.

(5) If the return value is true, the partial mapping M (s ') of the new state s' still satisfies isomorphism after the point pair (n, M) is added to the current state s;

(6) If the return value of F (s, n, m) is false, it indicates that (n, m) should not be added to the current state, and can play a role of pruning.

The F (s, n, m) consistency detection function is realized by giving a prospective node check rule, a subsequent node check rule, an adding node rule, a deleting node rule and a new state check rule.

Look-ahead node inspection rule: the consistency of the partial solution M(s) is checked by adding the considered candidate (n, M) to the current partial solution M(s).

Pruning node rules: the method is used for pruning the search tree in the process of deleting the point pairs, and the search efficiency is improved.

The prospective node check rule, the successor node check rule, the addition node rule, the deletion node rule and the new state check rule are marked as R _pred 、R _succ 、R _in 、R _out And R _new The method is concretely realized as follows:

note the book

For the same reason T ₂ (s) and

then there are:

and jumping to the step 2.2 after the point pairs in the P(s) are traversed. In step 2.2 it is decided whether the algorithm is finished or continued. The associated mapping M(s) according to the final state s contains all points in G2 to verify whether G1, G2 are isomorphic or sub-graph isomorphism.

Claims

1. A private network protocol classification method based on state machine subgraph isomorphic matching is characterized by comprising the following steps:

stage one: protocol state machine generation phase

Step 1: message alignment

step 1.2: analyzing the content of the protocol field from the binary sequence by using a sequence alignment algorithm to obtain each aligned protocol similar field as a key field candidate field;

step 2: packet clustering

step 2.1: establishing a matching score table

step 2.2: hierarchical clustering

Screening out the message sequence with the highest score from the matching score table, classifying the two messages into one class for the matching score of the two sequences, then detecting the triple of each message sequence according to the method, and clustering by using a hierarchical clustering algorithm;

and step 3: key field extraction

Extracting key fields, counting the distribution of each field in each cluster-generated class by a probability distribution counting method, calculating distribution variance, and determining the minimum as a message key field for identifying the type of the message generated by the cluster;

and 4, step 4: protocol state machine inference

The purpose of state machine inference is to infer a finite automatic automaton, which can recognize different message classes and completely describe the states of a protocol session and the conditions and paths of state transitions;

step 4.1: constructing augmented prefix trees

Constructing an augmented prefix tree according to message sequences and message types of messages in the input session set according to the message clustering and key field extraction results;

and 4.2: status flag

The state labels extract a prerequisite model from the observed conversation, which is represented by a regular expression as shown in the following formula:

.*(r ₁ |…|r _k )(a ₁ |…|a _j )*,(r,a ₁ ,…,a _j ∈M)

the "-" in the formula represents any type of message, and the "-" represents any number of repetitions of the previous part, and states with such a model are called preconditions. r is a radical of hydrogen ₁ ,…,r _k Indicating the type of message to be received, a ₁ ,…,a _j Indicating the type of message that can be received after r, a prerequisite indicating that the server only needs to receive r before executing the sequence of messages q ₁ ,…,r _k Any type of message, before reaching the state of accepting the message sequence q;

step 4.3: state machine simplification

And a second stage: state machine matching classification phase

G1 and G2 are said to be isomorphic if and only if M is a bijection and the corresponding edge is also a bijection, V1 representing the set of points in G1, V2 representing the set of points in G2, E1 representing the set of edges in G1And E2 represents the set of edges in G2 if

If a subgraph of G1 and a subgraph of G2 are taken in a double-shot mode, the subgraphs are called G1 and G2 subgraphs to be isomorphic;

Step 2: matching with a standard state machine of a protocol by using a subgraph isomorphism algorithm, and if the matching is successful, indicating that the protocol state machine is the subgraph isomorphism of the standard state machine, namely the captured protocol belongs to the protocol;

step 2.1: graphs of isomorphism or subgraph isomorphism to be verified are G1 and G2, the initial state of the algorithm is s0, and the initial isomorphism mapping M (s 0) is an empty set;

step 2.2: judging whether the current state s contains all points s of G2, wherein the state indicates that the current partial mapping M(s) meets isomorphism, if the associated mapping M(s) of the current state s contains all points in G2, outputting M(s), and finishing the algorithm; otherwise, entering step 2.3;

step 2.4: the rule traverses the point pair in the P(s), judges whether the point pair (n, m) is added into the mapping and updates s to be s' according to whether the current state feasibility rule F (s, n, m) is true;

(1) If the return value is true, the partial mapping M (s ') of the new state s' still satisfies isomorphism after the point pair (n, M) is added to the current state s;

(2) If the return value of F (s, n, m) is false, it indicates that (n, m) should not be added to the current state, and can play a role of pruning;

jumping to step 2.2 after the point pairs in the P(s) are traversed; judging whether the algorithm is finished or continued in step 2.2; the associated mapping M(s) according to the final state s contains all points in G2 verifying whether G1, G2 are isomorphic or sub-graph isomorphic.

2. The method for classifying the private network protocol based on the state machine subgraph isomorphic matching according to claim 1, wherein the basic idea of the alignment and similarity calculation of the sequence alignment algorithm in the step 1.2 of the stage one is as follows: using the scores to judge the matching degree of the two input character strings, giving a negative score if the two characters at the corresponding positions are different, giving a positive score if the two characters are the same, and finally settling the total scores of the two inputs;

matching two basic domains (MATCH) to obtain positive scores;

two basic domains are not matched (DISMATCH), and a negative score is obtained;

and if the element at the corresponding position of one sequence is empty (INDEL), negative scores are also obtained.

and aligning the messages of the unknown protocol, and clustering the messages to make the messages of the same type be classified into the same type so as to perform the next step of reversely outputting the state machine according to the message sequence.

3. The method according to claim 1, wherein the hierarchical clustering in step 2.2 of stage one is performed by pairwise comparison clustering, and a similarity measure criterion is determined for measuring the distance between clusters generated by clustering, assuming that C is the assumption that _i ，C _j For two clustering clusters, according to the similarity measurement criterion, C in the hierarchical clustering algorithm _i ，C _j The distance between them is:

step 2.2.1: initialization

In the initial stage, each sequence is divided into a class and is respectively used as a leaf node of a tree;

step 2.2.2: generating a system spanning tree

4. The method according to claim 1, wherein the state machine simplification method in step 4.3 of stage one is state simplification on a red-blue node framework, and the purpose of marking the state as a red-blue node is to determine the merging order, i.e. to divide all nodes into two classes: red nodes and blue nodes, the concrete simplifying steps are as follows:

and secondly, attempting to combine a red node and a blue node, wherein if the labels of the red node and the blue node are the same in the state marking stage, the combination can be carried out, all subtrees of the blue node are traversed firstly during the combination, all the subtrees are added into the subtree combination of the red node, all the newly added subtrees are taken as candidates and marked as blue, and if the combination fails due to different labels, the blue node is promoted to be the red node, and all the subtrees are marked as blue.

Repeating the steps 1 and 2 for all blue nodes which are not combined, wherein in the processing process, one blue node needs to be compared with all red nodes during combination to determine whether the blue node can be combined, and if the blue node cannot be combined with all red nodes, the blue node is promoted to be a red node;

5. The method for classifying private network protocols based on isomorphic matching of state machine subgraph according to claim 1, characterized in that the construction rules of the candidate point pair set P(s) in stage two step 2.3 are as follows:

out-terminal set

Is not in M ₁ (s) but belongs to M ₁ (s) a set of predecessor nodes of the nodes in(s); definition of the same principles

And

M ₂ (s) is the portion of the map M(s) corresponding to state s associated with G2;

(1) If it is not

And

is not empty at all, and is not empty,

to represent

Node with the smallest label;

(2) If it is not

And

are all empty, and

and

are not all empty, and are not all empty,

(3) If all four terminal sets are empty, P(s) = V ₁ -M ₁ (s)×{min(V ₂ -M ₂ (s))}；

(4) When it appears that only one in-terminal set or only one out-terminal set is empty, it can be shown that it is not possible for state s to construct the final isomorphism, and therefore state s does not need to be analyzed any further.

6. The method according to claim 1, wherein the look-ahead node checks the rules in step 2.4 of phase two: the consistency of the partial solution M(s) is checked by adding the considered candidate (n, M) to the current partial solution M(s);

and (4) checking a rule by the successor node: the consistency of the partial solution M(s) is checked after adding the considered candidate (n, M) to the current partial solution M(s);

adding a node rule: the method is used for pruning the search tree in the process of adding the point pairs, so that the search efficiency is improved;

and (3) deleting node rules: the method is used for pruning the search tree in the process of deleting the point pairs, so that the search efficiency is improved;