CN112866229A - High-speed network traffic identification method and system based on state diagram - Google Patents

High-speed network traffic identification method and system based on state diagram Download PDF

Info

Publication number
CN112866229A
CN112866229A CN202110042132.0A CN202110042132A CN112866229A CN 112866229 A CN112866229 A CN 112866229A CN 202110042132 A CN202110042132 A CN 202110042132A CN 112866229 A CN112866229 A CN 112866229A
Authority
CN
China
Prior art keywords
node
message
protocol
flow
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110042132.0A
Other languages
Chinese (zh)
Other versions
CN112866229B (en
Inventor
李振兴
陈曙晖
王飞
王小峰
庞立会
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110042132.0A priority Critical patent/CN112866229B/en
Publication of CN112866229A publication Critical patent/CN112866229A/en
Application granted granted Critical
Publication of CN112866229B publication Critical patent/CN112866229B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer And Data Communications (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to a high-speed network flow identification method and a high-speed network flow identification system based on a state diagram. The method comprises the steps of extracting protocol application characteristics in an identification domain with effective protocols according to flow data; determining a feature description mode according to the protocol application feature; determining a state diagram according to the feature description mode and the protocol application feature; acquiring flow to be identified; determining message five-tuple information and a message four-layer load position according to the flow to be identified; carrying out upper layer protocol identification according to the five-tuple information of the message and the four-layer load position of the message; and matching a protocol domain with the state diagram according to the five-tuple information of the message and the four-layer load position of the message to finish the identification of the message flow. The invention improves the application identification processing performance and improves the matching efficiency.

Description

High-speed network traffic identification method and system based on state diagram
Technical Field
The invention relates to the technical field of computer network security, in particular to a high-speed network flow identification method and system based on a state diagram.
Background
With the rise of the next generation of internet, network traffic becomes complicated, a high-speed network traffic real-time identification technology is the basis of network monitoring and security, and particularly, in the field of strong application to services such as fine identification and restoration of network traffic, user behavior analysis and the like, most of the existing network traffic classification and identification systems and technologies implement hard feature coding or are complex in description, and the capacity of a supported feature set is not large, and the main implementation methods of identification include port-based, payload matching-based, flow behavior feature-based and the like. However, as internet technology develops, CDN technology rises, and networks develop from 40G to 100G, the above system also faces some problems and challenges, such as a large error exists based on port identification; although accurate identification can be realized based on load matching, the whole load flow needs to be subjected to full-packet matching analysis results, so that the system performance is not high; the flow behavior characteristics are only suitable for a few specific flow analyses and are complex to implement, and the current popular flow identification technology based on the machine learning method is still under study, so that a large number of applications cannot be effectively identified and no effective application exists at present. Therefore, the mature and effective high-speed network application identification technology and method are researched, the application identification performance, accuracy and flexibility are effectively and greatly improved to meet the current network environment, the real-time analysis on network flow is very meaningful, and the important effect on ensuring the network safety is achieved.
Disclosure of Invention
The invention aims to provide a high-speed network flow identification method and a high-speed network flow identification system based on a state diagram, which improve the application identification processing performance and improve the matching efficiency.
In order to achieve the purpose, the invention provides the following scheme:
a high-speed network traffic identification method based on a state diagram comprises the following steps:
extracting the protocol application characteristics in the effective identification domain of the protocol according to the flow data;
determining a feature description mode according to the protocol application feature; the characteristic description mode comprises the following steps: each attribute domain supports direction description and protocol attribute logical relationship;
determining a state diagram according to the feature description mode and the protocol application feature; the state diagram comprises state nodes with a plurality of different functions; each functional node carries a type of action; the state graph is linked from a root node to a termination node, a protocol attribute index is taken as the root node, and the node action is linked to the next node in the state graph until the termination node; the uppermost layer of the state diagram is a root node index table, and the middle layer is a regular expression matching node layer; the state nodes with different functions comprise a characteristic matching node, a condition matching node, an identification termination node, a failure termination node, a character searching node, a load forward matching node and a port table matching node;
acquiring flow to be identified;
determining message five-tuple information and a message four-layer load position according to the flow to be identified;
carrying out upper layer protocol identification according to the five-tuple information of the message and the four-layer load position of the message;
and matching a protocol domain with the state diagram according to the five-tuple information of the message and the four-layer load position of the message to finish the identification of the message flow.
Optionally, the determining a state diagram according to the feature description mode and the protocol application feature specifically includes:
generating state nodes with different functions according to the characteristic description mode and the protocol application characteristic;
and connecting the state nodes according to the protocol attribute logical relationship, and determining the state diagram.
Optionally, the determining, according to the traffic to be identified, five-tuple information of the packet and a four-layer load position of the packet specifically includes:
preprocessing the flow to be identified, and determining five-tuple information of the message and a four-layer load position of the message; the pretreatment comprises the following steps: analyzing for the L2-L4 layer of the message, recombining the flow by IP fragmentation, classifying by quintuple, managing the flow table and preserving the order of TCP.
Optionally, the matching between the protocol domain and the state diagram according to the five-tuple information of the packet and the four-layer load position of the packet is performed to complete the identification of the packet flow, and the method specifically includes:
and traversing the state diagram according to the five-tuple information of the message and the index root node of the four-layer load position of the message to finish the identification of the message flow.
A state diagram based high speed network traffic identification system, comprising:
the protocol application feature extraction module is used for extracting the protocol application features in the effective identification domain of the protocol according to the flow data;
the characteristic description module is used for determining a characteristic description mode according to the protocol application characteristic; the characteristic description mode comprises the following steps: each attribute domain supports direction description and protocol attribute logical relationship;
the characteristic encoder module is used for determining a state diagram according to the characteristic description mode and the protocol application characteristic; the state diagram comprises state nodes with a plurality of different functions; each functional node carries a type of action; the state graph is linked from a root node to a termination node, a protocol attribute index is taken as the root node, and the node action is linked to the next node in the state graph until the termination node; the uppermost layer of the state diagram is a root node index table, and the middle layer is a regular expression matching node layer; the state nodes with different functions comprise a characteristic matching node, a condition matching node, an identification termination node, a failure termination node, a character searching node, a load forward matching node and a port table matching node;
the flow acquiring module to be identified is used for acquiring the flow to be identified;
the flow processing module is used for determining the five-tuple information of the message and the four-layer load position of the message according to the flow to be identified;
the protocol application first identification module is used for identifying an upper layer protocol according to the five-tuple information of the message and the four-layer load position of the message;
and the protocol application second identification module is used for matching the protocol domain with the state diagram according to the five-tuple information of the message and the four-layer load position of the message to finish the identification of the message flow.
Optionally, the feature encoder module specifically includes:
the state node generating unit is used for generating state nodes with different functions according to the characteristic description mode and the protocol application characteristic;
and the state diagram determining unit is used for connecting the state nodes according to the protocol attribute logical relationship and determining the state diagram.
Optionally, the stream processing module specifically includes:
the flow processing unit is used for preprocessing the flow to be identified and determining the five-tuple information of the message and the four-layer load position of the message; the pretreatment comprises the following steps: analyzing for the L2-L4 layer of the message, recombining the flow by IP fragmentation, classifying by quintuple, managing the flow table and preserving the order of TCP.
Optionally, the protocol application second identification module specifically includes:
and the protocol application identification unit is used for traversing the state diagram according to the five-tuple information of the message and the index root node of the four-layer load position of the message to complete the identification of the message flow.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a high-speed network flow identification method and a system based on a state diagram, which determine the state diagram according to the characteristic description mode and the protocol application characteristic, i.e. skillfully organize the characteristic matching process in a diagram mode, match the flow characteristic with extremely low time and space complexity, so as to achieve the application requirement of high-speed network flow real-time identification. The mode of converting the described artificial features into the state diagram which can be processed by the computer in a higher performance in an optimized mode is equivalent to the mode that the preprocessing such as feature organization, abnormal feature processing and the like is finished in the off-line compiling stage, so that the special and complicated feature processing operation in the application identification process is greatly reduced, and the application identification processing performance is greatly improved. The state diagram is organized in a continuous memory mode, the Cache hit rate can be improved to the greatest extent, the memory access times are reduced, and the matching efficiency can be improved greatly.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a high-speed network traffic identification method based on a state diagram according to the present invention;
FIG. 2 is a diagram illustrating a specific definition of feature matching nodes according to the present invention;
FIG. 3 is a diagram illustrating a specific definition of nodes matching conditions according to the present invention;
FIG. 4 is a diagram illustrating the detailed definition of a condition table provided in the present invention;
FIG. 5 is a diagram illustrating specific definitions of an identified termination node according to the present invention;
FIG. 6 is a diagram illustrating a detailed definition of a failed termination node according to the present invention;
FIG. 7 is a diagram illustrating a specific definition of a character search node according to the present invention;
fig. 8 is a schematic diagram of a specific definition of a load forward matching node provided by the present invention;
FIG. 9 is a diagram illustrating specific definitions of port table matching nodes according to the present invention;
FIG. 10 is a schematic diagram of a state diagram organization provided by the present invention;
FIG. 11 is a flowchart illustrating a method for determining a feature description for a protocol application according to the present invention;
FIG. 12 is a flow chart illustrating a determination state provided by the present invention;
FIG. 13 is a schematic diagram illustrating an upper protocol identification process of a TCP protocol according to the present invention;
fig. 14 is a schematic diagram illustrating the identification process of the upper application of the UDP protocol provided in the present invention;
FIG. 15 is a flowchart illustrating the process of identifying the HTTP protocol as an upper layer application according to the present invention;
fig. 16 is a schematic diagram of a message flow identification process provided by the present invention;
fig. 17 is a schematic structural diagram of a high-speed network traffic identification system based on a state diagram according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a high-speed network flow identification method and a high-speed network flow identification system based on a state diagram, which improve the application identification processing performance and improve the matching efficiency.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a high-speed network traffic identification method based on a state diagram, as shown in fig. 1, the high-speed network traffic identification method based on a state diagram includes:
and S101, extracting the protocol application characteristics in the effective identification domain of the protocol according to the flow data.
S102, determining a feature description mode according to the protocol application feature; the characteristic description mode comprises the following steps: each attribute field supports direction descriptions and protocol attribute logical relationships.
The method supports the feature description of protocols such as HTTP, SSL/TLS, TCP, UDP, DNS, RTSP, RTMP, SIP and the like, and describes application features in a Json format, wherein each attribute field supports direction description such as cts, stc and the like, wherein cts represents direction matching from a client to a server, stc represents direction matching from the server to the client, and undescribed direction represents matching of both directions. In order to describe the application features more accurately, the module supports the description of and/or relationship between the same protocol attribute fields, and more specifically, the supported protocol feature fields are shown in table 1, where table 1 is as follows:
TABLE 1
Figure BDA0002896310940000061
Figure BDA0002896310940000071
S103, determining a state diagram according to the feature description mode and the protocol application feature; the state diagram comprises state nodes with a plurality of different functions; each functional node carries a type of action; the state graph is linked from a root node to a termination node, a protocol attribute index is taken as the root node, and the node action is linked to the next node in the state graph until the termination node; the uppermost layer of the state diagram is a root node index table, and the middle layer is a regular expression matching node layer; the state nodes with different functions comprise a characteristic matching node, a condition matching node, an identification termination node, a failure termination node, a character searching node, a load forward matching node and a port table matching node.
The Node has the function of carrying out feature matching on the message, the filling value of the Signature Index field is the attribute value corresponding to the feature matching Node, the Node determines the initial position of the feature matching feature library according to the Signature Index, and jumps to the appointed Node after the matching is successful, or jumps to the Node at the position of the Failure Node Offset, and the specific definition of the Node is shown in figure 2.
And a condition matching Node (condition _ match _ Node), the Node has the function of matching the characteristics and the relationship, a condition comparison Node is set during the compiling, the Node inquires a corresponding condition table by using stateID to judge whether the relationship can be established, if the several relationship are established, the Node jumps to Nth Success Node Offset for next processing, otherwise, the Node jumps to Failure Node Offset for processing, and the specific definition of the Node is shown in figure 3.
The condition table function is a static table, records the and relation between the attributes, and completes the and matching relation by a condition matching node lookup table, where Number State identifies how many and relations there are, Index State identifies the start position of the and relation, condition b1 represents the condition ID where the condition has a binary and relation, condition c1 represents the condition ID where the condition has a ternary and relation, if not, 0xFFFF is filled to indicate that the matching is finished, and the condition table is specifically defined as shown in fig. 4.
And identifying a terminating node (setappid _ node), wherein the function of the node is to mark the already identifiable apid of the current node, and the node is a terminating state node. The specific definition of a node is shown in fig. 5.
A failure termination node (failure _ node), the function of which is to mark a termination node that fails to match the feature library, and identify that the message flow is identified but not identified, and the specific definition of the node is shown in fig. 6.
The function of the node is to identify that special characters need to be matched during matching, and the special characters are carried in a character set manner, the character set needing to be matched is stored in charset map, and the specific definition of the node is shown in fig. 7.
The load forward matching Node (stream _ and _ cmp _ Node) has the function of recording the sequential matching relationship between two directions of a tcp/udp protocol (more specifically, the load in the direction from a client to a server or the direction from the server to the client is matched first, and then the load in the other direction is matched), if the matching is successful, the Node jumps to the Node at the location of the Success Node, otherwise, the Node jumps to the Node at the location of the Fail Node, and the specific definition of the Node is shown in fig. 8.
The Port table matching Node (Port _ and _ cmp _ Node) has the function of recording tcp/udp protocol Port index identification, jumping to the Node for Port matching after tcp/udp load matching is hit, wherein hit means that a Port and a load and relationship are hit, jumping to a Node at the Port successful Node Offset position, otherwise jumping to a Node at the Stcfail Node Offset/Ctsfail Node Offset position, and the specific definition of the Node is shown in fig. 9.
In the feature compiling process, a domain separator (such as HTTP domain separator '\ r \ n' and the like) is added according to the protocol attribute, so that the domain can be segmented even if the features are not hit in the matching process, and the message segmentation domain does not need to be judged by backtracking. The top layer of the generated state diagram is a root node index table, the middle layer is a regular expression matching node layer, 24 attributes are set in the module, 48 characteristic regular expression matching nodes are arranged in two directions, the lower layer is other functional nodes, matching of other functions is completed, and the bottom layer is a termination node. The organization structure of the state diagram generated after compiling is shown in fig. 10.
S103 specifically comprises the following steps:
and generating state nodes with different functions according to the characteristic description mode and the protocol application characteristic.
And connecting the state nodes according to the protocol attribute logical relationship, and determining the state diagram.
And S104, acquiring the flow to be identified.
And S105, determining the five-tuple information of the message and the four-layer load position of the message according to the flow to be identified.
S105 specifically comprises the following steps:
preprocessing the flow to be identified, and determining five-tuple information of the message and a four-layer load position of the message; the pretreatment comprises the following steps: analyzing for the L2-L4 layer of the message, recombining the flow by IP fragmentation, classifying by quintuple, managing the flow table and preserving the order of TCP. And saving the flow processing related information so as to restore the continuous processing when the next message arrives.
And S106, carrying out upper layer protocol identification according to the five-tuple information of the message and the four-layer load position of the message.
And S107, matching a protocol domain with the state diagram according to the five-tuple information of the message and the four-layer load position of the message, and finishing the identification of the message flow.
S107 specifically comprises the following steps:
and traversing the state diagram according to the five-tuple information of the message and the index root node of the four-layer load position of the message to finish the identification of the message flow.
The method realizes protocol analysis such as HTTP, SSL/TLS, TCP, UDP, DNS, RTSP, RTMP, SIP, FTP and the like, matching of related domains, sub-stream analysis association and other identification methods, after the protocol is analyzed to the concerned domain, an index root node traverses a feature compiler module to generate a state diagram, a classified message stream is identified from a root node according to node jumping actions of the state diagram until a leaf node is reached, the traversal process executes operations such as specific message content matching, relation comparison and the like according to the node type, next node processing is carried out according to the execution state, and the traversal of the current root node is exited after the processing of the termination node is reached. If a certain node is not matched and the message reaches the end, the current matched state needs to be stored in the flow table, and when the next message reaches, the state is recovered to continue matching.
The module is further described by upper layer application identification of network mainstream protocols TCP, UDP and HTTP, and other protocols are similar and are not particularly described. The identification function of the TCP/UDP branch mainly comprises a port and a load, and the two attributes have an AND or relation. Firstly, a port is used as a Key, a port lookup table is searched, a root node of the current attribute graph state is returned after the Key is hit, graph traversal is carried out according to the root node, protocol application identification of TCP/UDP branches is carried out, if the protocol application identification is RTMP, FTP, SSL/TLS, HTTP, RTSP, DNS, SIP and the like, the root node of the state graph is indexed by the protocol attribute, and then the state graph traversal is carried out until the identification is upper-layer application or the state graph matching is terminated.
The HTTP protocol branch identification function supports the AND/OR relationship of a plurality of HTTP attribute domains, the module supports 3 attribute AND maximally, in the process of analyzing the HTTP header attribute, an AC (multi-mode matching) algorithm is used for identifying each attribute field, the current message load position is determined to be the specific attribute of the HTTP protocol (more specifically, the attribute is identified to be User _ Agent, Host and the like), then the root node is indexed by the attribute, feature matching is carried out on the related domain content, when the state diagram jumps to the condition and matching node, the condition state needs to be stored in the flow table, and when the subsequent attribute or message is matched, the AND relationship is completed according to the state query condition table, the domain analysis and matching of the HTTP relationship are carried out, and the process of state diagram traversal is jumped out until the HTTP protocol bearing application or the HTTP protocol domain is identified to be completed and cannot be identified. When the attribute features are matched, since the domain separator (such as "\ r \ n") is added in the feature compiling stage, if the features cannot be hit, the separator is hit and returned at the end of the domain.
More specifically, the four described module functions are organically combined together, a set of complete application identification method is formed by mutual influence, extracted network application characteristics are described by a characteristic description module to generate a regular characteristic description file, a characteristic compiler module compiles the characteristic description file to generate a binary state diagram file, the binary state diagram file is loaded by an application identification module, data of data messages processed by a flow module is subjected to message identification processing, and finally a protocol and an application are identified
S107 and S106 support various regular expression engines, such as Hyperscan and DFA state machines, and can be flexibly selected according to system requirements.
The method compiles the feature preprocessing into a state diagram with directed certainty, jumps to an end node according to an attribute matching result during matching, a protocol application identification system applying the method does not need backtracking matching, can identify messages within the time complexity close to O (1), can increase the hit rate of CACHE and reduce the access frequency, and combines a high-performance regular expression matching engine, so that the system performance of application identification can be greatly improved, and the application identification features can be flexibly updated, thereby meeting the requirement of high-speed network flow identification.
The method is further explained by TCP, UDP and HTTP protocols, other protocol processing flows are similar, and the core processing flow specifically comprises the following S:
s1, as shown in fig. 11, determining a feature description mode according to the protocol application feature.
S2, generating a binary state diagram file, that is, according to the node and structure design of the compiler module, first parsing the feature file described in the JSON format, then generating related nodes, and finally linking the nodes according to the relationship, writing the state diagram file in the TLV format, taking HTTP protocol feature compilation as an example, and compiling other protocol attributes is similar to this process, and the more specific flow is shown in fig. 12.
And S3, loading the generated binary state diagram file to a memory.
S4: and if the flow is accessed, the incoming flow processing module analyzes the flow information and the four-layer load position and sends the data to the identification module to execute the S5 process, otherwise, the incoming flow is waited for.
S5: if the traffic is the TCP protocol, S6 is executed, and if the traffic is the UDP protocol, S7 is executed.
S6: as shown in fig. 13, the TCP protocol upper layer application identifies that if the upper layer protocol carried by the TCP is HTTP, S8 is executed, and if the upper layer protocol is other upper layer protocols RTMP, FTP, SSL/TLS, RTSP, etc., the upper layer protocol identifies S, and the execution flow is similar to the HTTP protocol identification flow, and will not be described separately. More specifically, as shown in the following figure, if the port number is 80 or 8080, it is verified whether the protocol is an HTTP protocol according to the load-specific location, if the protocol is an HTTP protocol, S8 processing is executed, if the port number is 443 or 9443, it is verified whether the protocol is a TLS/SSL protocol according to the load-specific location, if the protocol is an SSL/TLS protocol, SSL protocol processing is executed, if the port number is 21, it is verified whether the protocol is an FTP protocol according to the load-specific location, if the protocol is an FTP protocol, FTP protocol processing is executed, otherwise, S9 is executed by using a TCP port index root node or a TCP root node, if S9 identifies protocols such as HTTP, RTMP, FTP, SSL/TLS, RTSP, and the like, the corresponding protocol processing is entered, and if not, the identification execution S10 is completed.
S7: as shown in fig. 14, if the upper layer protocol carried by the UDP is DNS, SIP, or the like, the upper layer protocol identification S is entered, and the execution flow is similar to the HTTP protocol identification flow, which is not separately described. More specifically, as shown in the following figure, if the UDP port is 53, it is verified whether the UDP port is the DNS protocol according to the specific location of the load, if the UDP port is the DNS protocol, the process flow of the DNS protocol is entered, otherwise, the UDP port is used to index the root node or the UDP root node to execute S9, if the S9 is identified as the SIP protocol, the process flow of the SIP protocol is entered, otherwise, the identification is completed, and S10 is executed.
S8: as shown in fig. 15, the HTTP protocol is identified as an upper-layer application processing flow, and a more specific execution flow chart is shown in the following figure, first, attribute identification of the HTTP protocol is entered, and if the HTTP protocol is a URI field, the root node indexed by the HTTP URI is executed S9; if the domain is a Content-Type domain, performing S9 by using the root node indexed by HTTPCONTENTTYPE; if the domain is a Cookie domain, the root node indexed by HTTPCOOKIE executes S9; if the domain is the Host domain, the root node indexed by the http Host is executed to S9; if the domain is a Referer domain, the root node indexed by HTTPREFERER is used for executing S9; if the domain is the User _ Agent domain, executing S9 by the root node indexed by HTTPUSEERAGENT; if the domain is a Bundleid domain, performing S9 with the root node indexed by HTTPBUNDLEID; if the domain is the X _ Requested _ With domain, the root node indexed by HTTPXREQUESTEDWITH executes S9; if the domain is the Server domain, the root node indexed by HTTPSERVER executes S9; if it is the Location field, the root node indexed by http Location executes S9, otherwise the attribute field not of interest is skipped. And if the HTTP header field is not finished, continuing the identification of the header field until the upper-layer application is identified, otherwise, executing S9 by using the HTTP load as the index of the root node, and executing S10 after the completion.
S9: the protocol application feature state diagram traversal process is the processing process at the bottom layer of the method, the feature matching process is completed and the recognition result is returned from the traversal of the root node until the termination node is reached, and the more specific execution process is shown in the flow chart 16.
S10: and identifying the identification result in the flow table to finish the identification.
Compared with the prior art or method, the method skillfully converts the feature set into the traversal of the feature graph, realizes the port and load in two directions or the load and/or relationship in two directions, achieves higher-precision recognition rate by less content matching when the loads are matched, and greatly improves the recognition accuracy rate and the processing performance of the system. About hundreds of APP application flow tests are acquired and marked, the identification accuracy rate is basically about 80% -95%, and an application identification system applying the method on a general multi-core processor can meet the 100G link performance test requirement at a certain domestic local point.
The advantages mainly come from 2 technical points of the above application identification method:
the feature compiling process is optimized in a preprocessing mode, the process converts the described artificial features into a state diagram which can be processed by a computer in a higher performance mode in an optimized mode, and the preprocessing such as feature organization, abnormal feature processing and the like is finished in an off-line compiling stage, so that the special and complicated feature processing operation in the application identification process is greatly reduced, and the application identification processing performance is greatly improved.
In the traversal process of the state diagram of the application identification, the well-organized state diagram is traversed mainly according to the load matching result of the application, in the process of load matching, only a part of loads need to be matched to skip the residual loads of the message or reach a new position for new feature matching according to the running result of the state diagram, so that all loads of the message flow do not need to be matched to obtain the identification result, backtracking matching is not needed in the matching process, the identification process is completed with the matching performance close to O (1), and the application identification matching performance is greatly improved. And the state diagram is organized in a continuous memory mode, so that the Cache hit rate can be improved to the greatest extent, the access times are reduced, and the matching efficiency can be greatly improved.
Fig. 17 is a schematic structural diagram of a high-speed network traffic identification system based on a state diagram, as shown in fig. 17, the high-speed network traffic identification system based on a state diagram provided in the present invention includes: the system comprises a protocol application feature extraction module 1701, a feature description module 1702, a feature encoder module 1703, a traffic to be identified acquisition module 1704, a stream processing module 1705, a protocol application first identification module 1706 and a protocol application second identification module 1707.
The protocol application feature extraction module 1701 is used for extracting the protocol application features in the identification domain where the protocol is valid according to the traffic data.
The feature description module 1702 is configured to determine a feature description manner according to the protocol application feature; the characteristic description mode comprises the following steps: each attribute field supports direction descriptions and protocol attribute logical relationships.
The feature encoder module 1703 is configured to determine a state diagram according to the feature description mode and the protocol application feature; the state diagram comprises state nodes with a plurality of different functions; each functional node carries a type of action; the state graph is linked from a root node to a termination node, a protocol attribute index is taken as the root node, and the node action is linked to the next node in the state graph until the termination node; the uppermost layer of the state diagram is a root node index table, and the middle layer is a regular expression matching node layer; the state nodes with different functions comprise a characteristic matching node, a condition matching node, an identification termination node, a failure termination node, a character searching node, a load forward matching node and a port table matching node.
The flow to be identified obtaining module 1704 is configured to obtain a flow to be identified.
The flow processing module 1705 is configured to determine packet five-tuple information and a packet four-layer load position according to the traffic to be identified.
The protocol application first identifying module 1706 is configured to perform upper layer protocol identification according to the packet five-tuple information and the packet four-layer payload location.
The second protocol application identification module 1707 is configured to perform matching between a protocol domain and the state diagram according to the five-tuple information of the packet and a four-layer payload position of the packet, and complete identification of a packet flow.
The feature encoder module 1703 specifically includes: a state node generating unit and a state diagram determining unit.
And the state node generating unit is used for generating state nodes with different functions according to the characteristic description mode and the protocol application characteristic.
And the state diagram determining unit is used for connecting the state nodes according to the protocol attribute logical relationship and determining the state diagram.
The stream processing module 1705 specifically includes: a stream processing unit.
The flow processing unit is used for preprocessing the flow to be identified and determining the five-tuple information of the message and the four-layer load position of the message; the pretreatment comprises the following steps: analyzing for the L2-L4 layer of the message, recombining the flow by IP fragmentation, classifying by quintuple, managing the flow table and preserving the order of TCP.
The protocol application second identifying module 1707 specifically includes: a protocol application identification unit.
And the protocol application identification unit is used for traversing the state diagram according to the five-tuple information of the message and the index root node of the four-layer load position of the message to complete the identification of the message flow.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (8)

1. A high-speed network traffic identification method based on a state diagram is characterized by comprising the following steps:
extracting the protocol application characteristics in the effective identification domain of the protocol according to the flow data;
determining a feature description mode according to the protocol application feature; the characteristic description mode comprises the following steps: each attribute domain supports direction description and protocol attribute logical relationship;
determining a state diagram according to the feature description mode and the protocol application feature; the state diagram comprises state nodes with a plurality of different functions; each functional node carries a type of action; the state graph is linked from a root node to a termination node, a protocol attribute index is taken as the root node, and the node action is linked to the next node in the state graph until the termination node; the uppermost layer of the state diagram is a root node index table, and the middle layer is a regular expression matching node layer; the state nodes with different functions comprise a characteristic matching node, a condition matching node, an identification termination node, a failure termination node, a character searching node, a load forward matching node and a port table matching node;
acquiring flow to be identified;
determining message five-tuple information and a message four-layer load position according to the flow to be identified;
carrying out upper layer protocol identification according to the five-tuple information of the message and the four-layer load position of the message;
and matching a protocol domain with the state diagram according to the five-tuple information of the message and the four-layer load position of the message to finish the identification of the message flow.
2. The method for identifying high-speed network traffic based on the state diagram according to claim 1, wherein the determining the state diagram according to the feature description mode and the protocol application feature specifically comprises:
generating state nodes with different functions according to the characteristic description mode and the protocol application characteristic;
and connecting the state nodes according to the protocol attribute logical relationship, and determining the state diagram.
3. The method according to claim 1, wherein the determining of the message five-tuple information and the message four-layer load position according to the traffic to be identified specifically comprises:
preprocessing the flow to be identified, and determining five-tuple information of the message and a four-layer load position of the message; the pretreatment comprises the following steps: analyzing for the L2-L4 layer of the message, recombining the flow by IP fragmentation, classifying by quintuple, managing the flow table and preserving the order of TCP.
4. The method according to claim 1, wherein the matching of a protocol domain and the state diagram is performed according to the five-tuple information of the packet and a four-layer payload position of the packet to complete the identification of the packet flow, specifically comprising:
and traversing the state diagram according to the five-tuple information of the message and the index root node of the four-layer load position of the message to finish the identification of the message flow.
5. A high speed network traffic identification system based on a state diagram, comprising:
the protocol application feature extraction module is used for extracting the protocol application features in the effective identification domain of the protocol according to the flow data;
the characteristic description module is used for determining a characteristic description mode according to the protocol application characteristic; the characteristic description mode comprises the following steps: each attribute domain supports direction description and protocol attribute logical relationship;
the characteristic encoder module is used for determining a state diagram according to the characteristic description mode and the protocol application characteristic; the state diagram comprises state nodes with a plurality of different functions; each functional node carries a type of action; the state graph is linked from a root node to a termination node, a protocol attribute index is taken as the root node, and the node action is linked to the next node in the state graph until the termination node; the uppermost layer of the state diagram is a root node index table, and the middle layer is a regular expression matching node layer; the state nodes with different functions comprise a characteristic matching node, a condition matching node, an identification termination node, a failure termination node, a character searching node, a load forward matching node and a port table matching node;
the flow acquiring module to be identified is used for acquiring the flow to be identified;
the flow processing module is used for determining the five-tuple information of the message and the four-layer load position of the message according to the flow to be identified;
the protocol application first identification module is used for identifying an upper layer protocol according to the five-tuple information of the message and the four-layer load position of the message;
and the protocol application second identification module is used for matching the protocol domain with the state diagram according to the five-tuple information of the message and the four-layer load position of the message to finish the identification of the message flow.
6. The system according to claim 5, wherein the feature encoder module specifically comprises:
the state node generating unit is used for generating state nodes with different functions according to the characteristic description mode and the protocol application characteristic;
and the state diagram determining unit is used for connecting the state nodes according to the protocol attribute logical relationship and determining the state diagram.
7. The system according to claim 5, wherein the flow processing module specifically comprises:
the flow processing unit is used for preprocessing the flow to be identified and determining the five-tuple information of the message and the four-layer load position of the message; the pretreatment comprises the following steps: analyzing for the L2-L4 layer of the message, recombining the flow by IP fragmentation, classifying by quintuple, managing the flow table and preserving the order of TCP.
8. The high-speed network traffic identification system based on the state diagram according to claim 5, wherein the protocol application second identification module specifically comprises:
and the protocol application identification unit is used for traversing the state diagram according to the five-tuple information of the message and the index root node of the four-layer load position of the message to complete the identification of the message flow.
CN202110042132.0A 2021-01-13 2021-01-13 High-speed network traffic identification method and system based on state diagram Active CN112866229B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110042132.0A CN112866229B (en) 2021-01-13 2021-01-13 High-speed network traffic identification method and system based on state diagram

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110042132.0A CN112866229B (en) 2021-01-13 2021-01-13 High-speed network traffic identification method and system based on state diagram

Publications (2)

Publication Number Publication Date
CN112866229A true CN112866229A (en) 2021-05-28
CN112866229B CN112866229B (en) 2022-09-06

Family

ID=76003282

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110042132.0A Active CN112866229B (en) 2021-01-13 2021-01-13 High-speed network traffic identification method and system based on state diagram

Country Status (1)

Country Link
CN (1) CN112866229B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113518093A (en) * 2021-09-14 2021-10-19 南京中孚信息技术有限公司 Data packet identification method and device and network equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5655075A (en) * 1994-05-12 1997-08-05 Kokusai Denshin Denwa Co., Ltd. Protocol method for validating an input protocol specification
CN103647765A (en) * 2013-11-29 2014-03-19 北京广利核***工程有限公司 A graphical representation method for describing a network protocol based on a UML sequence diagram and a topological graph
CN104272861A (en) * 2012-05-10 2015-01-07 阿尔卡特朗讯 Transferring messages
CN104283736A (en) * 2014-08-03 2015-01-14 成都网安科技发展有限公司 Network communication quintuple fast matching algorithm based on improved automatic state machine
CN106888209A (en) * 2017-03-02 2017-06-23 中国科学院信息工程研究所 A kind of industry control bug excavation method based on protocol status figure extreme saturation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5655075A (en) * 1994-05-12 1997-08-05 Kokusai Denshin Denwa Co., Ltd. Protocol method for validating an input protocol specification
CN104272861A (en) * 2012-05-10 2015-01-07 阿尔卡特朗讯 Transferring messages
CN103647765A (en) * 2013-11-29 2014-03-19 北京广利核***工程有限公司 A graphical representation method for describing a network protocol based on a UML sequence diagram and a topological graph
CN104283736A (en) * 2014-08-03 2015-01-14 成都网安科技发展有限公司 Network communication quintuple fast matching algorithm based on improved automatic state machine
CN106888209A (en) * 2017-03-02 2017-06-23 中国科学院信息工程研究所 A kind of industry control bug excavation method based on protocol status figure extreme saturation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JINCHENG ZHONG 等: "Accelerating DFA Construction Based on Hierarchical Merging", 《 2019 IEEE 5TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC)》 *
李天磊: "面向高速网络的细粒度网络应用协议识别", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113518093A (en) * 2021-09-14 2021-10-19 南京中孚信息技术有限公司 Data packet identification method and device and network equipment
CN113518093B (en) * 2021-09-14 2021-12-07 南京中孚信息技术有限公司 Data packet identification method and device and network equipment

Also Published As

Publication number Publication date
CN112866229B (en) 2022-09-06

Similar Documents

Publication Publication Date Title
US8990240B2 (en) Predictive algorithm for search box auto-complete
CN102148805B (en) Feature matching method and device
CN107257390B (en) URL address resolution method and system
CN110083746B (en) Quick matching identification method and device based on character strings
CN106982150A (en) A kind of mobile Internet user behavior analysis method based on Hadoop
CN103188267B (en) A kind of protocol analysis method based on DFA
CN107220096A (en) A kind of json data analysis methods and device
CN110007906B (en) Script file processing method and device and server
CN112612761B (en) Data cleaning method, device, equipment and storage medium
CN112866229B (en) High-speed network traffic identification method and system based on state diagram
CN109388557A (en) A kind of control visiting method, computer readable storage medium and terminal device
CN113312175A (en) Operator determining and operating method and device
CN109547294B (en) Networking equipment model detection method and device based on firmware analysis
CN103685280B (en) Message matching method, state machine compiling method and equipment
CN115913655B (en) Shell command injection detection method based on flow analysis and semantic analysis
CN112887280B (en) Network protocol metadata extraction system and method based on automaton
CN114567688B (en) FPGA-based collaborative network protocol analysis method and device
Jorge et al. Online parsing of visual languages using adjacency grammars
CN114827030B (en) Flow classification device based on folded SRAM and table entry compression method
Guo Research on web data mining based on topic crawler
CN109948018B (en) Method and system for rapidly extracting Web structured data
CN107092529A (en) OLAP method of servicing, apparatus and system
CN113596098A (en) Session retrieval method, device, equipment and computer-readable storage medium
US11157447B2 (en) File system search proxying
CN109688043B (en) IMAP protocol multi-link association analysis method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant