CN110061869B - Network track classification method and device based on keywords - Google Patents

Network track classification method and device based on keywords Download PDF

Info

Publication number
CN110061869B
CN110061869B CN201910281096.6A CN201910281096A CN110061869B CN 110061869 B CN110061869 B CN 110061869B CN 201910281096 A CN201910281096 A CN 201910281096A CN 110061869 B CN110061869 B CN 110061869B
Authority
CN
China
Prior art keywords
track
flow
cluster
adopting
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910281096.6A
Other languages
Chinese (zh)
Other versions
CN110061869A (en
Inventor
孟博
何旭东
王德军
李子茂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South Central Minzu University
Original Assignee
South Central University for Nationalities
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South Central University for Nationalities filed Critical South Central University for Nationalities
Priority to CN201910281096.6A priority Critical patent/CN110061869B/en
Publication of CN110061869A publication Critical patent/CN110061869A/en
Application granted granted Critical
Publication of CN110061869B publication Critical patent/CN110061869B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0896Bandwidth or capacity management, i.e. automatically increasing or decreasing capacities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a keyword-based network track classification method and device, wherein the classification method comprises the steps of firstly obtaining a first track cluster by combining flow statistics characteristics with a K-means method, dividing each cluster of the first track cluster into a fixed field IF and a variable field VF as the input of a track division method, and calculating the length of each fixed field and the position information of the fixed field; then carrying out curve fitting by adopting an IF distribution fitting method to obtain an IF position distribution curve; and then obtaining the type of the IF contained in each extreme value interval and the quantity of various types of IF by adopting an IF classification method, inputting the type of the IF contained in each extreme value interval and the quantity of various types of IF into a track clustering method, outputting a second track cluster, deducing a separator by adopting a keyword deduction method, separating a keyword from the IF according to the separator, and finally forming a signature database. The invention can improve the classification accuracy and greatly improve the classification efficiency.

Description

Network track classification method and device based on keywords
Technical Field
The invention relates to the technical field of information security, in particular to a network track classification method and device based on keywords.
Background
The classification of network traffic is the basis for ensuring the security of network space. The traffic classification identifies different types of network protocol flows, and has great significance for guaranteeing the fields of communication safety, network management, network attack and defense, intrusion detection, protocol reversal and the like.
With the development of the internet, the 5G world-wide-article interconnection era is about to come. Terminals such as computers, mobile phones and sensors generate a large amount of flow, and the classification management of the large amount of flow provides challenges for the existing flow classification scheme. Traffic classification is crucial to network management, such as monitoring network resources, discovering and handling network failures in time, guaranteeing network quality of service, guaranteeing network efficiency, etc. On the one hand, for security purposes, traffic classification, filtering, and detection of malicious activities all require mastering the type of application flows in the network, and network operators can quickly react to potential events based on malicious traffic detection. On the other hand, the existence of a new class of applications (e.g., P2P, VoIP, and video streaming) in the Internet has increased dramatically. These applications are particularly difficult to classify and often have strict resource requirements for bandwidth (e.g., P2P) or QoS requirements (e.g., low delay and jitter for VoIP applications), which pose challenges to network operators.
In the prior art, a method for classifying network traffic which is widely applied is a deep packet inspection method, and the method classifies the network traffic by identifying signatures and fingerprints included in a network protocol stream and adopting a pattern matching method.
In the process of implementing the invention, the applicant of the invention finds that at least the following technical problems exist in the prior art:
because new network protocols continuously appear and network protocol versions are replaced, the deep packet inspection method needs to manually maintain a signature database; in addition, the classification efficiency of the deep packet inspection method is reduced sharply because the private protocol stream or the zero-day application protocol stream cannot directly obtain the signature. Due to the increase of network traffic, the traditional deep packet inspection method suffers from high computational complexity and is difficult to meet the requirement of a high-bandwidth network.
That is to say, the deep packet inspection method in the prior art needs to manually maintain the signature database, and has the technical problems of large workload, long time consumption, low efficiency and difficulty in being applied to a high-bandwidth network. Therefore, the efficient network track classification method based on the keywords has great significance.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for classifying network tracks based on keywords, so as to solve or at least partially solve the technical problems of large workload and low efficiency of the prior art.
The invention provides a keyword-based network track classification method in a first aspect, which comprises the following steps:
step S1: the method comprises the steps that an input mixed protocol track is preliminarily classified based on a K-means method of flow statistical characteristics to obtain K first track clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;
step S2: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;
step S3: obtaining an IF position distribution curve according to fitting, solving a curve extreme value by adopting a curve extreme value solving method, extracting IF in each extreme value interval of the curve, and carrying out IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;
step S4: marking all tracks according to the IF with the largest number in each extreme value, clustering by adopting a K-Means method, and outputting a second track cluster;
step S5: selecting a track cluster corresponding to the target security protocol according to the second track cluster in the step S4; then, taking the track cluster of the target safety protocol as input, sequentially executing the steps S1-S3 to obtain the type of the IF and the number of various types of IF contained in the extreme value interval corresponding to the target safety protocol, adopting a separator reasoning method, deducing separators by comparing the head and the tail of adjacent IF, separating keywords from the IF according to the separators, and finally forming a signature database;
step S6: and marking the track flow to be processed by utilizing the formed signature database, converting the track flow into vectors, and classifying the converted vectors as the input of a k-means method.
In one implementation, step S1 specifically includes:
step S1.1: inputting a mixed protocol track < Flow, FlowID >, marking a network track based on Flow statistics characteristics to obtain a characteristic Vector < FlowID, Vector >, and obtaining K first track clusters < Cluster, FlowID > by taking the characteristic Vector as the input of a K-means clustering method;
step S1.2: taking < Cluster, FlowID, Flow > as the input of a track reverse ordering method, calculating the length Flow _ length of the Flow for the track Flow in each first track Cluster Cluster, then adopting a quick ordering method, ordering the Flow according to the Sequence of the Flow _ length from small to large and forming a queue Sequence < Cluster, FlowID, Flow >;
step S1.3: inputting Sequence < Cluster, FlowID, Flow >, taking out two flows from the head of the queue, wherein the two flows are numbered as i and j < Flow _ i, Flow _ j >; then using < Flow _ i, Flow _ j > as the input of the needleman _ Wunsch method to obtain the common fixed field IF and variable field VF of the Flow _ i, the Flow _ j; then, the length of IF is counted to obtain IF _ l and the distance between IF and the starting point of flow _ i to obtain IF _ s.
In one implementation, step S2 specifically includes:
step S2.1: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF, wherein the weight distribution method of the IF _ l weighting method comprises the following steps:
Figure GDA0003416445580000031
the larger the length IF _ l of the fixed field is, the larger the probability of the keyword appearing in the fixed field is, the larger the distributed weight is, when the length of the fixed field is between 1byte and 8 bytes, the weight is 0, when the length of the fixed field is between 9 bytes and 16 bytes, the weight is 1, when the length of the fixed field is between 17 bytes and 24 bytes, the weight is 2, and when the length of the fixed field is greater than or equal to 25 bytes, the weight is 3;
step S2.2: eliminating background noise by adopting a noise elimination method on the weight calculated in the step S2.1 to obtain a corrected IF _ l weight;
step S2.3: and (3) fitting the corrected IF _ l weight and the length of the fixed field by adopting a preset sub-B-spline curve to obtain an IF position distribution curve, wherein the curve equation of the B-spline is shown as a formula (1):
Figure GDA0003416445580000032
in the formula (1), di(i ═ 0, 1.. times.n) denotes a control point, Ni,k(u) (i ═ 0, 1.., n) denotes a k-th order canonical B spline basis function.
In one implementation, step S3 specifically includes:
step S3.1: a derivative function f' (x) of the IF position distribution curve is obtained, and an array x having intervals L in a defined domain is obtainediL is smaller, and solve for f' (x)i) Then f' (x) is screened outi)×f′(xi+1) X is less than or equal to 0iForming an array, and then combining xiThe array is used as a starting point of a Newton CG method, and an extreme point is solved;
step S3.2: sorting the IF from large to small by adopting a quick sorting method to obtain sorted IF for the IF _ l in each IF distribution interval, then selecting unmarked IF from the first IF, calculating Levenshtein distance backwards in sequence, marking the IF as a known class when the Levenshtein distance value meets a threshold value, adding the IF into a new class IF not, repeatedly executing the step of marking the class by adopting the method based on the Levenshtein distance until all the IF are classified, and finally outputting the IF distribution interval, the IF type and the IF number.
In one implementation, step S4 specifically includes:
step S4.1: taking the IF distribution interval, the IF type and the IF number as input, selecting the IF with the maximum number in each IF distribution interval, marking all tracks according to the selected IF and IF distribution intervals, and converting the tracks into IF distribution vectors IFVector;
step S4.2: and inputting an IF distribution vector IFvector, and obtaining a second track cluster by adopting a K-means method.
In one implementation, step S5 specifically includes:
s5.1: executing the track division method of step S1 with the second track cluster generated in step S4 as input, to obtain a corresponding fixed field IF, a length IF _ l of the fixed field, and position information IF _ S of the fixed field;
s5.2: taking the fixed field IF, the length IF _ l of the fixed field and the position information IF _ S of the fixed field obtained in the step S5.1 as input, executing the IF distribution fitting method of the step S2 to obtain an IF position distribution curve;
s5.3: taking the position distribution curve in the step S5.3 as input, executing the IF classification method in the step S3, and outputting all the IF in each IF distribution interval;
s5.4: inputting all IF in each distribution interval by adopting a separator reasoning method, marking corresponding tracks by using the IF, counting and deducing separators according to the appearance of the separators at the head and the tail of the keywords, and extracting the keywords in the IF by combining the separators;
s5.5: and storing the extracted keywords and the inferred separators in the IF as a signature database of the track flow classification.
Based on the same inventive concept, the second aspect of the present invention provides a keyword-based network trajectory classification apparatus, including:
the system comprises a track segmentation module, a data processing module and a data processing module, wherein the track segmentation module is used for carrying out primary classification on input mixed protocol tracks based on a K-means method of flow statistics characteristics to obtain K first track clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;
the IF distribution solving module is used for weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;
the IF classification module is used for obtaining an IF position distribution curve according to fitting, solving curve extremum by adopting a curve extremum solving method, extracting IF in each extremum interval of the curve, and performing IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;
the track clustering module is used for marking all tracks according to the maximum IF in each extreme value, clustering by adopting a K-Means method and outputting a second track cluster;
the keyword inference module is used for selecting a track cluster corresponding to the target safety protocol according to the second track cluster in the track clustering module; then, taking the track cluster of the target safety protocol as input, sequentially inputting the track cluster into a track segmentation module, an IF distribution solving module and an IF classification module to obtain the type of IF and the number of various IF contained in an extreme value interval corresponding to the target safety protocol, adopting a separator reasoning method, deducing separators by comparing the head and the tail of adjacent IF, separating keywords from IF according to the separators, and finally forming a signature database;
and the track classification module is used for marking a track flow to be processed by utilizing the formed signature database, converting the track flow into a vector and classifying the converted vector as the input of the k-means method.
In one implementation, the trajectory segmentation module is specifically configured to perform the following steps:
step S1.1: inputting a mixed protocol track < Flow, FlowID >, marking a network track based on Flow statistics characteristics to obtain a characteristic Vector < FlowID, Vector >, and obtaining K first track clusters < Cluster, FlowID > by taking the characteristic Vector as the input of a K-means clustering method;
step S1.2: taking < Cluster, FlowID, Flow > as the input of a track reverse ordering method, calculating the length Flow _ length of the Flow for the track Flow in each first track Cluster Cluster, then adopting a quick ordering method, ordering the Flow according to the Sequence of the Flow _ length from small to large and forming a queue Sequence < Cluster, FlowID, Flow >;
step S1.3: inputting Sequence < Cluster, FlowID, Flow >, taking out two flows from the head of the queue, wherein the two flows are numbered as i and j < Flow _ i, Flow _ j >; then using < Flow _ i, Flow _ j > as the input of the needleman _ Wunsch method to obtain the common fixed field IF and variable field VF of the Flow _ i, the Flow _ j; then, the length of IF is counted to obtain IF _ l and the distance between IF and the starting point of flow _ i to obtain IF _ s.
Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.
Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect when executing the program.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a network track classification method based on keywords, which takes a mixed network track as input and outputs the type of a target protocol track and a signature database. Firstly, dividing a track into a fixed field IF and a variable field VF by a track division method, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed field; then carrying out curve fitting by adopting an IF (Interchangeable field) distribution fitting method IF to obtain an IF position distribution curve; obtaining the type of IF contained in each extreme value interval and the quantity of various types of IF by adopting an IF classification method, outputting a second track cluster by adopting a track clustering method, deducing separators by adopting a keyword deduction method, separating keywords from IF according to the separators, and finally forming a signature database; and finally, marking the track flow to be processed by using the formed signature database, converting the track flow into a vector, and classifying the converted vector as the input of a k-means method.
Compared with the deep packet inspection method in the prior art, the method provided by the invention has the advantages that the classification is carried out by adopting the statistical characteristics through the track segmentation method, and the mixed track set with a less accurate classification result is obtained. Then, by an IF (empirical field) distribution fitting method, an IF classification method and a track clustering method, a peak interval IF is marked and converted into a vector, and the IF distribution curve where the peak is located obtains the number and the distance of various IF passing through the input extreme value according to the mixed track set output by the K-means method in the step S1; then marking all tracks according to the IF with the largest number in each extreme value; and clustering by adopting a K-Means method, outputting a clustering result and obtaining a more accurate result. That is, a more accurate classification result can be obtained through step S4. The invention further deduces the separator by a keyword inference method in the step S5, and can output a cluster with quite high purity and quite accurate through the peak interval IF and the cluster (the cluster refers to a class) output by the K-means method, namely, the accuracy of classification is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow diagram of a method for keyword-based network trajectory classification in one embodiment;
FIG. 2 is a general flow diagram of a keyword-based network trajectory classification method in one embodiment;
FIG. 3 is a flowchart illustrating the track segmentation method in step S1;
FIG. 4 is a schematic flow chart of the IF distribution fitting method in step S2;
FIG. 5 is a flowchart illustrating the IF classification method in step S3;
FIG. 6 is a flowchart illustrating the trajectory clustering method in step S4;
FIG. 7 is a flowchart illustrating the keyword inference method in step S5;
FIG. 8 is a schematic diagram of a trajectory segmentation algorithm;
FIG. 9 is a flow chart of a noise cancellation method;
FIG. 10 is a code diagram of an IF classification statistics algorithm;
FIG. 11 is a block diagram of an apparatus for keyword based network trajectory classification in one embodiment;
FIG. 12 is a block diagram of a computer-readable storage medium in an embodiment of the invention;
fig. 13 is a block diagram of a computer device in an embodiment of the present invention.
Detailed Description
The embodiment of the invention provides a network track classification method based on keywords, which classifies network tracks to be processed by a signature database formed by a track segmentation method, an IF (empirical field) distribution fitting method, an IF classification method, a track clustering method, a keyword inference method and a keyword inference method. That is, the invention forms the signature database by extracting the IF in the cluster with quite high purity, then only uses the signature database to perform the grouping marking and uses the K-means to classify, thereby improving the classification accuracy, greatly improving the classification efficiency and solving the technical problems of large workload and low efficiency of the method in the prior art.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example one
The embodiment provides a network track classification method based on keywords, please refer to fig. 1, and the method includes:
step S1: the method comprises the steps that an input mixed protocol track is preliminarily classified based on a K-means method of flow statistical characteristics to obtain K first track clusters; in each cluster, the protocol tracks with similar lengths are arranged in a reverse order according to the length of the track flow, the protocol tracks with similar lengths are compared pairwise by adopting a needleman _ Wunsch method, the tracks are divided into fixed fields IF and variable fields VF, and the length IF _ l of each fixed field and the position information IF _ s of each fixed field are calculated.
Specifically, in step S1, a first trajectory cluster of a preliminary classification is obtained by using a trajectory segmentation method. The input of the track segmentation method is a mixed network track flow, and firstly, the flow is initially divided into clusters by a K-means method based on flow statistical characteristics; then, in a cluster, arranging according to the length of the track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, wherein the needleman _ Wunsch method can mark fixed fields in the tracks; finally, the track is divided into a fixed field IF and a variable field VF and the length IF _ l of each IF and the position information IF _ s of the fixed field are calculated, wherein the position information IF _ s of the fixed field refers to the distance from the fixed field to the first character of the track. A schematic flow chart of the trajectory segmentation method is shown in fig. 3.
Step S2: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;
specifically, in step S2, an IF position distribution curve is obtained based on step S1 by an IF distribution fitting method. IF distribution fitting method as shown in fig. 4, first, IF _ l, and IF _ s are input; then, weighting IF _ l by an IF weighting method; alternatively, a noise cancellation method may be employed to remove background noise; and finally, performing curve fitting on the weighted IF by adopting a curve fitting method, and outputting an IF distribution curve.
Step S3: obtaining an IF position distribution curve according to fitting, solving a curve extreme value by adopting a curve extreme value solving method, extracting IF in each extreme value interval of the curve, and carrying out IF classification statistics based on Levenshtein distance; and then outputting the type of the IF contained in each extremum interval and the number of various types of IF.
Specifically, referring to fig. 5, in step S3, the IF classification method is used to obtain the types of IFs and the number of types of IFs included in each extremum interval based on step S2. The distance of the obtained IF means that there are a plurality of different IF in each peak interval, because the input trace types are different, different traces have a plurality of IF in the same peak interval. All the IFs corresponding to the respective extremum sections, and IF _ l and IF _ S are output in step S3.
Step S4: and marking all tracks according to the IF with the largest number in each extreme value, clustering by adopting a K-Means method, and outputting a second track cluster.
Specifically, inputting various IF sets in a distribution interval by a track clustering method, and selecting the IF with the largest quantity in each maximum value area; then, marking all tracks according to the selected IF, and converting the tracks into vectors; finally, clustering by adopting a K-Means method and outputting a clustering result. A schematic diagram of the trajectory clustering method is shown in fig. 6.
Since a network trace contains many IFs and VFs, the IF distribution varies from network trace to network trace. First, step S1 uses statistical characteristics, length, offset, arrival time, and the like. These are simply features of the track as a whole, and IF it is desired to identify which class the track is, features of the IF contained by the track are required. Where the IF characteristics include the type of IF, IF l (IF length), IF s (distance of IF in the track to the first character).
Step S5: selecting a track cluster corresponding to the target security protocol according to the second track cluster in the step S4; and then, taking the track cluster of the target safety protocol as input, sequentially executing the steps S1-S3 to obtain the type of the IF and the number of various types of IF contained in the extreme value interval corresponding to the target safety protocol, deducing separators by comparing the head and the tail of adjacent IF by adopting a separator reasoning method, separating keywords from the IF according to the separators, and finally forming a signature database.
Specifically, the steps S1 to S3 are sequentially executed with the trajectory cluster of the target security protocol as an input, that is, the trajectory segmentation method, the IF distribution fitting method, and the IF classification method of step S1 are performed. A schematic diagram of the keyword inference method is shown in fig. 7. Firstly, inputting a track cluster, and selecting a track cluster of a target safety protocol; secondly, inputting the classified track clusters of the safety protocols into a track segmentation method, and obtaining the quantity statistics of various IF at the peak value through an IF distribution fitting method and an IF classification method; and finally, deducing the separators by comparing the head and the tail of adjacent IF by adopting a separator reasoning method, separating the keywords from the IF, and taking the keywords and the separators as a new signature database.
Step S6: and marking the track flow to be processed by utilizing the formed signature database, converting the track flow into vectors, and classifying the converted vectors as the input of a k-means method.
In particular, after the signature data is formed, it can be used to label the trace stream and converted into vectors for classification as input to the k-means method. The signature database contains the characteristics capable of being classified, so that subsequent classification can be performed through the signature database, the classification accuracy is improved, and meanwhile, the classification efficiency can be greatly improved.
In general, please refer to fig. 2, which is a general flow of the keyword-based network trajectory classification method according to an embodiment. The steps S1 to S5 are included.
In one embodiment, step S1 specifically includes:
step S1.1: inputting a mixed protocol track < Flow, FlowID >, marking a network track based on Flow statistics characteristics to obtain a characteristic Vector < FlowID, Vector >, and obtaining K first track clusters < Cluster, FlowID > by taking the characteristic Vector as the input of a K-means clustering method;
step S1.2: taking < Cluster, FlowID, Flow > as the input of a track reverse ordering method, calculating the length Flow _ length of the Flow for the track Flow in each first track Cluster Cluster, then adopting a quick ordering method, ordering the Flow according to the Sequence of the Flow _ length from small to large and forming a queue Sequence < Cluster, FlowID, Flow >;
step S1.3: inputting Sequence < Cluster, FlowID, Flow >, taking out two flows from the head of the queue, wherein the two flows are numbered as i and j < Flow _ i, Flow _ j >; then using < Flow _ i, Flow _ j > as the input of the needleman _ Wunsch method to obtain the common fixed field IF and variable field VF of the Flow _ i, the Flow _ j; then, the length of IF is counted to obtain IF _ l and the distance between IF and the starting point of flow _ i to obtain IF _ s.
Specifically, the flow statistics feature set has 18 sets of features, such as: average packet length, standard deviation of inter-packet arrival time, total traffic length (in bytes and/or packets), fourier transform of inter-packet arrival time, etc. The flow-based set of statistical features is shown in table 1. The trace stream is stored in a file in a PCAP format, and 18 groups of statistical characteristics of the stream can be obtained through a simple statistical method and the stream is converted into a characteristic Vector < FlowID, Vector >. Where Vector contains 18 sets of eigenvalues, as shown in table 1.
TABLE 1 statistical feature set of streams
Figure GDA0003416445580000101
The input of the K-means clustering method is < FlowID, Vector >, the class number K can be preset, and the clustering result is observed, the K value is determined, and the track Cluster < Cluster, FlowID > is obtained. The K-means clustering method can adopt the existing clustering method, and the specific process is not described in detail.
The idea of the invention is that the security protocol tracks with similar lengths experience similar states, so that the distribution of the generated key word positions is more concentrated. The invention arranges the tracks in a reverse order according to the lengths of the tracks, groups every two adjacent tracks, compares the safety protocol tracks with similar total length by adopting a needleman _ Wunsch method, preferentially aligns longer keywords, and finally outputs IF and counts the positions of the keywords. The specific algorithm is shown in fig. 8.
In particular, the needleman _ Wunsch method is commonly used in bioinformatics, and is one of the earliest methods for comparing biological sequences using dynamic programming techniques, which searches for homology between proteins or genes by aligning the sequences of the proteins or genes. Since the N-W method is capable of handling large amounts of structured and complex data.
In the embodiment of the invention, the input of the message segmentation method is security protocol flows, and each flow comprises a plurality of keywords, separators and change fields. For example: report/dc00321, report is a keyword,/is a delimiter, and dc00321 is a change field. The N-W method is commonly used in DNA or protein alignment studies. A two-dimensional matrix is filled through similar rewards and gap penalties, a backtracking path is selected, and then a comparison result is output. For example, report/dc00321 and report/dc04369, two parts can be obtained by sequence alignment, fixed field IF: report/dc0, change field VF: 0321 and 4369. Since the separators do not change, the first three characters of the changed field change less, and the N-W method cannot distinguish, IF three-way alignment is used, such as report/dc00321, report/dc04369 and report/da45813, the IF of the alignment result is still report/dc0, and the result obtained by only using the multi-way alignment can only take care of most of the same sequences.
Inputting a trace set FlowSet, firstly obtaining DFlowSet by arranging according to the length of Flow in a reverse order, then outputting DFlow1 and DFlow2 from DFlowSet once until DFlowSet is empty, then inputting DFlow1 and DFlow2 into a needleman _ Wunsch method to obtain IF, and counting IF _ i, IF _ s and VF.
Preferably, the invention also improves the needleman _ Wunsch to improve the classification accuracy. The scoring system of the improved needleman _ Wunsch method follows three principles: as many matches as possible, the consecutive fields are preferentially aligned, resulting in only the slot corresponding to the first sequence. MijIs the value of j column in row i of the scoring matrix, MijThe calculation formula (2) shows that w is a penalty interval, SijIs the similarity of the ith character of message a to the jth character of b, SijThe scoring criteria of (1) are as in equation 1, backtracking from MijInitially, M is selectedijThe upper, left, and upper left bins of (1) are the largest values, and if the same size bin is encountered, the upper left and upper left are preferably selected. SijThe longest most similar fields can be preferentially matched together by the running match score, and the only room to generate a corresponding first sequence is to be able to output the position of the keyword directly from the matching result. Improved scoring system SijIs shown in equation (3).
Mij=max{Mi-1,j-1+Sij,Mi,j-1+w,Mi-1,j+ w equation (2)
Figure GDA0003416445580000121
Wherein, in the formula (3), ai-1Refers to the i-1 st character of the message a, bj-1Similarly, k and p are constants that control the running match score.
In one embodiment, step S2 specifically includes:
step S2.1: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF, wherein the weight distribution method of the IF _ l weighting method comprises the following steps:
Figure GDA0003416445580000122
the larger the length IF _ l of the fixed field is, the larger the probability of the keyword appearing in the fixed field is, the larger the distributed weight is, when the length of the fixed field is between 1byte and 8 bytes, the weight is 0, when the length of the fixed field is between 9 bytes and 16 bytes, the weight is 1, when the length of the fixed field is between 17 bytes and 24 bytes, the weight is 2, and when the length of the fixed field is greater than or equal to 25 bytes, the weight is 3;
step S2.2: eliminating background noise by adopting a noise elimination method on the weight calculated in the step S2.1 to obtain a corrected IF _ l weight;
step S2.3: and (3) fitting the corrected IF _ l weight and the length of the fixed field by adopting a preset sub-B-spline curve to obtain an IF position distribution curve, wherein the curve equation of the B-spline is shown as a formula (1):
Figure GDA0003416445580000123
in the formula (1), di(i ═ 0, 1.. times.n) denotes a control point, Ni,k(u) (i ═ 0, 1.., n) denotes a k-th order canonical B spline basis function.
Specifically, when the length (IF _ l) of the fixed field is larger, the probability of the occurrence of the keyword in the fixed field is larger, and the weight value is assigned to be larger. When the length of the fixed field reaches 32 bytes, indicating that 4 consecutive characters appear on both tracks, the weight is set to a maximum of 3.
In a specific implementation process, due to the inherent defect of needleman _ Wunsch, a wrong matching can be generated, so that an experimental result is influenced. A flow chart of the noise cancellation method is shown in fig. 9. The method firstly deletes the weight value of IF with the weight value of 1, secondly judges the data type of IF, IF IF is 10-ary number, IF _ w-k is 0.192/IF _ w is IF _ w; where the value of k is 2, IF the IF is a 16-ary number
Figure GDA0003416445580000131
Where k is 1.5. It is empirically calculated that 0.192 is the average noise generated by two 10-ary random numbers and 0.107 is the average noise generated by two 16-ary random numbers.
In step 2.3, curve fitting means that a distribution function which is convenient for computer processing is calculated from a plurality of discrete point sets, the distribution function can eliminate wrong data and interference, and can compensate missing and predict future trends.
In one embodiment, step S3 specifically includes:
step S3.1: a derivative function f' (x) of the IF position distribution curve is obtained, and an array x having intervals L in a defined domain is obtainediL is smaller, and solve for f' (x)i) Then f' (x) is screened outi)×f′(xi+1) X is less than or equal to 0iForming an array, and then combining xiThe array is used as a starting point of a Newton CG method, and an extreme point is solved;
step S3.2: sorting the IF from large to small by adopting a quick sorting method to obtain sorted IF for the IF _ l in each IF distribution interval, then selecting unmarked IF from the first IF, calculating Levenshtein distance backwards in sequence, marking the IF as a known class when the Levenshtein distance value meets a threshold value, adding the IF into a new class IF not, repeatedly executing the step of marking the class by adopting the method based on the Levenshtein distance until all the IF are classified, and finally outputting the IF distribution interval, the IF type and the IF number.
Specifically, a curve fitted by a B spline is a complex polynomial curve, and the Newton CG method is an optimization algorithm for obtaining an extreme value of the curve.
Step 3.2 IF classification statistical method, using xa,iRepresenting minimum points by xb,i+1Representing the maximum point. Experiments show that the IF is intensively distributed
Figure GDA0003416445580000132
In the present invention, the interval is an IF distribution interval. Inputting IF distribution intervals, IF _ s, IF _ l and IF by an IF classification statistical method, and firstly, sorting the IF from large to small by adopting a quick sorting method for the IF _ l in each IF distribution interval to obtain sorted IF; secondly, selecting unmarked IF from the first IF, calculating Levenshtein distance backwards in sequence, considering the IF to be in a known class when the Levenshtein distance value meets a threshold value, marking the IF as the known class, and adding the IF into a new class IF not; and thirdly, repeating the second step until all the IF are classified. And finally outputting the IF distribution interval, the IF type and the IF number. The algorithm of the IF classification statistical method is shown in fig. 10.
The Levenshtein distance is one of common editing distances, two character strings A and B are input by the Levenshtein distance method, the character string A is converted into the character string B through a rule of operating one character at a time, and the operation times are output.
In one embodiment, step S4 specifically includes:
step S4.1: taking the IF distribution interval, the IF type and the IF number as input, selecting the IF with the maximum number in each IF distribution interval, marking all tracks according to the selected IF and IF distribution intervals, and converting the tracks into IF distribution vectors IFVector;
step S4.2: and inputting an IF distribution vector IFvector, and obtaining a second track cluster by adopting a K-means method.
In one embodiment, step S5 specifically includes:
s5.1: executing the track division method of step S1 with the second track cluster generated in step S4 as input, to obtain a corresponding fixed field IF, a length IF _ l of the fixed field, and position information IF _ S of the fixed field;
s5.2: taking the fixed field IF, the length IF _ l of the fixed field and the position information IF _ S of the fixed field obtained in the step S5.1 as input, executing the IF distribution fitting method of the step S2 to obtain an IF position distribution curve;
s5.3: taking the position distribution curve in the step S5.3 as input, executing the IF classification method in the step S3, and outputting all the IF in each IF distribution interval;
s5.4: inputting all IF in each distribution interval by adopting a separator reasoning method, marking corresponding tracks by using the IF, counting and deducing separators according to the appearance of the separators at the head and the tail of the keywords, and extracting the keywords in the IF by combining the separators;
s5.5: and storing the extracted keywords and the inferred separators in the IF as a signature database of the track flow classification.
Example two
The present embodiment provides a keyword-based network trajectory classification device, please refer to fig. 11, which includes:
the trajectory segmentation module 201 is configured to perform preliminary classification on the input mixed protocol trajectories based on a K-means method of flow statistics characteristics to obtain K first trajectory clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;
an IF distribution solving module 202, configured to weight the length IF _ l of the fixed field by using an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;
the IF classification module 203 is used for obtaining an IF position distribution curve according to fitting, solving curve extremum by adopting a curve extremum solving method, extracting IF in each extremum interval of the curve, and performing IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;
the track clustering module 204 is used for marking all tracks according to the maximum IF in each extreme value, clustering by adopting a K-Means method and outputting a second track cluster;
the keyword inference module 205 is configured to select a track cluster corresponding to the target security protocol according to the second track cluster in the track clustering module; then, taking the track cluster of the target safety protocol as input, sequentially inputting the track cluster into a track segmentation module, an IF distribution solving module and an IF classification module to obtain the type of the IF and the number of various IFs contained in an extreme value interval corresponding to the target safety protocol, deducing separators by comparing the head and the tail of adjacent IFs by adopting a separator reasoning method, separating keywords from the IF according to the separators, and finally forming a signature database;
and the track classification module 206 is configured to mark a track stream to be processed by using the formed signature database, convert the track stream into a vector, and classify the converted vector as an input of the k-means method.
In one implementation, the trajectory segmentation module 201 is specifically configured to perform the following steps:
step S1.1: inputting a mixed protocol track < Flow, FlowID >, marking a network track based on Flow statistics characteristics to obtain a characteristic Vector < FlowID, Vector >, and obtaining K first track clusters < Cluster, FlowID > by taking the characteristic Vector as the input of a K-means clustering method;
step S1.2: taking < Cluster, FlowID, Flow > as the input of a track reverse ordering method, calculating the length Flow _ length of the Flow for the track Flow in each first track Cluster Cluster, then adopting a quick ordering method, ordering the Flow according to the Sequence of the Flow _ length from small to large and forming a queue Sequence < Cluster, FlowID, Flow >;
step S1.3: inputting Sequence < Cluster, FlowID, Flow >, taking out two flows from the head of the queue, wherein the two flows are numbered as i and j < Flow _ i, Flow _ j >; then using < Flow _ i, Flow _ j > as the input of the needleman _ Wunsch method to obtain the common fixed field IF and variable field VF of the Flow _ i, the Flow _ j; then, the length of IF is counted to obtain IF _ l and the distance between IF and the starting point of flow _ i to obtain IF _ s.
In one implementation, the IF distribution solving module 202 is specifically configured to perform the following steps:
step S2.1: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF, wherein the weight distribution method of the IF _ l weighting method comprises the following steps:
Figure GDA0003416445580000161
the larger the length IF _ l of the fixed field is, the larger the probability of the keyword appearing in the fixed field is, the larger the distributed weight is, when the length of the fixed field is between 1byte and 8 bytes, the weight is 0, when the length of the fixed field is between 9 bytes and 16 bytes, the weight is 1, when the length of the fixed field is between 17 bytes and 24 bytes, the weight is 2, and when the length of the fixed field is greater than or equal to 25 bytes, the weight is 3;
step S2.2: eliminating background noise by adopting a noise elimination method on the weight calculated in the step S2.1 to obtain a corrected IF _ l weight;
step S2.3: and (3) fitting the corrected IF _ l weight and the length of the fixed field by adopting a preset sub-B-spline curve to obtain an IF position distribution curve, wherein the curve equation of the B-spline is shown as a formula (1):
Figure GDA0003416445580000162
in the formula (1), di(i ═ 0, 1.. times.n) denotes a control point, Ni,k(u) (i ═ 0, 1.., n) denotes a k-th order canonical B spline basis function.
In one implementation, the IF classification module 203 is specifically configured to perform the following steps:
step S3.1: the derivative function f' (x) of the IF position distribution curve is obtained, and the interval within the defined domain is obtainedArray x of LiL is smaller, and solve for f' (x)i) Then f' (x) is screened outi)×f′(xi+1) X is less than or equal to 0iForming an array, and then combining xiThe array is used as a starting point of a Newton CG method, and an extreme point is solved;
step S3.2: sorting the IF from large to small by adopting a quick sorting method to obtain sorted IF for the IF _ l in each IF distribution interval, then selecting unmarked IF from the first IF, calculating Levenshtein distance backwards in sequence, marking the IF as a known class when the Levenshtein distance value meets a threshold value, adding the IF into a new class IF not, repeatedly executing the step of marking the class by adopting the method based on the Levenshtein distance until all the IF are classified, and finally outputting the IF distribution interval, the IF type and the IF number.
In one implementation, the track clustering module 204 is specifically configured to perform the following steps:
step S4.1: taking the IF distribution interval, the IF type and the IF number as input, selecting the IF with the maximum number in each IF distribution interval, marking all tracks according to the selected IF and IF distribution intervals, and converting the tracks into IF distribution vectors IFVector;
step S4.2: and inputting an IF distribution vector IFvector, and obtaining a second track cluster by adopting a K-means method.
In one implementation, the keyword inference module 205 is specifically configured to perform the following steps:
s5.1: executing the track division method of step S1 with the second track cluster generated in step S4 as input, to obtain a corresponding fixed field IF, a length IF _ l of the fixed field, and position information IF _ S of the fixed field;
s5.2: taking the fixed field IF, the length IF _ l of the fixed field and the position information IF _ S of the fixed field obtained in the step S5.1 as input, executing the IF distribution fitting method of the step S2 to obtain an IF position distribution curve;
s5.3: taking the position distribution curve in the step S5.3 as input, executing the IF classification method in the step S3, and outputting all the IF in each IF distribution interval;
s5.4: inputting all IF in each distribution interval by adopting a separator reasoning method, marking corresponding tracks by using the IF, counting and deducing separators according to the appearance of the separators at the head and the tail of the keywords, and extracting the keywords in the IF by combining the separators;
s5.5: and storing the extracted keywords and the inferred separators in the IF as a signature database of the track flow classification.
Since the apparatus introduced in the second embodiment of the present invention is an apparatus used for implementing the keyword-based network trajectory classification method in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the apparatus based on the method introduced in the first embodiment of the present invention, and thus details are not described herein again. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.
EXAMPLE III
Based on the same inventive concept, the present application further provides a computer-readable storage medium 300, please refer to fig. 12, on which a computer program 311 is stored, which when executed implements the method in the first embodiment.
Since the computer-readable storage medium introduced in the third embodiment of the present invention is a computer-readable storage medium used for implementing the keyword-based network trajectory classification method in the first embodiment of the present invention, based on the method introduced in the first embodiment of the present invention, persons skilled in the art can understand the specific structure and deformation of the computer-readable storage medium, and therefore details are not described here. Any computer readable storage medium used in the method of the first embodiment of the present invention falls within the intended scope of the present invention.
Example four
Based on the same inventive concept, the present application further provides a computer device, please refer to fig. 13, which includes a memory 401, a processor 402, and a computer program 403 stored in the memory and running on the processor, and when the processor 402 executes the above program, the method in the first embodiment is implemented.
Since the computer device introduced in the fourth embodiment of the present invention is a computer device used for implementing the keyword-based network trajectory classification method in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the computer device based on the method introduced in the first embodiment of the present invention, and thus details are not described herein. All the computer devices used in the method in the first embodiment of the present invention are within the scope of the present invention.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims (10)

1. A network track classification method based on keywords is characterized by comprising the following steps:
step S1: the method comprises the steps that an input mixed protocol track is preliminarily classified based on a K-means method of flow statistical characteristics to obtain K first track clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;
step S2: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;
step S3: obtaining an IF position distribution curve according to fitting, solving a curve extreme value by adopting a curve extreme value solving method, extracting IF in each extreme value interval of the curve, and carrying out IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;
step S4: marking all tracks according to the IF with the largest number in each extreme value, clustering by adopting a K-Means method, and outputting a second track cluster;
step S5: selecting a track cluster corresponding to the target security protocol according to the second track cluster in the step S4; then, taking the track cluster of the target safety protocol as input, sequentially executing the steps S1-S3 to obtain the type of the IF and the number of various types of IF contained in the extreme value interval corresponding to the target safety protocol, adopting a separator reasoning method, deducing separators by comparing the head and the tail of adjacent IF, separating keywords from the IF according to the separators, and finally forming a signature database;
step S6: and marking the track flow to be processed by utilizing the formed signature database, converting the track flow into vectors, and classifying the converted vectors as the input of a k-means method.
2. The method according to claim 1, wherein step S1 specifically comprises:
step S1.1: inputting a mixed protocol track < Flow, FlowID >, marking a network track based on Flow statistics characteristics to obtain a characteristic Vector < FlowID, Vector >, and obtaining K first track clusters < Cluster, FlowID > by taking the characteristic Vector as the input of a K-means clustering method;
step S1.2: taking < Cluster, FlowID, Flow > as the input of a track reverse ordering method, calculating the length Flow _ length of the Flow for the track Flow in each first track Cluster Cluster, then adopting a quick ordering method, ordering the Flow according to the Sequence of the Flow _ length from small to large and forming a queue Sequence < Cluster, FlowID, Flow >;
step S1.3: inputting Sequence < Cluster, FlowID, Flow >, taking out two flows from the head of the queue, wherein the two flows are numbered as i and j < Flow _ i, Flow _ j >; then using < Flow _ i, Flow _ j > as the input of the needleman _ Wunsch method to obtain the common fixed field IF and variable field VF of the Flow _ i, the Flow _ j; then, the length of IF is counted to obtain IF _ l and the distance between IF and the starting point of flow _ i to obtain IF _ s.
3. The method according to claim 1, wherein step S2 specifically comprises:
step S2.1: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF, wherein the weight distribution method of the IF _ l weighting method comprises the following steps:
Figure FDA0003416445570000021
the larger the length IF _ l of the fixed field is, the larger the probability of the keyword appearing in the fixed field is, the larger the distributed weight is, when the length of the fixed field is between 1byte and 8 bytes, the weight is 0, when the length of the fixed field is between 9 bytes and 16 bytes, the weight is 1, when the length of the fixed field is between 17 bytes and 24 bytes, the weight is 2, and when the length of the fixed field is greater than or equal to 25 bytes, the weight is 3;
step S2.2: eliminating background noise by adopting a noise elimination method on the weight calculated in the step S2.1 to obtain a corrected IF _ l weight;
step S2.3: and (3) fitting the corrected IF _ l weight and the length of the fixed field by adopting a preset sub-B-spline curve to obtain an IF position distribution curve, wherein the curve equation of the B-spline is shown as a formula (1):
Figure FDA0003416445570000022
in the formula (1), di(i ═ 0, 1.. times.n) denotes a control point, Ni,k(u) (i ═ 0, 1.., n) denotes a k-th order canonical B spline basis function.
4. The method according to claim 3, wherein step S3 specifically comprises:
step S3.1: a derivative function f' (x) of the IF position distribution curve is obtained, and an array x having intervals L in a defined domain is obtainediAnd solve for f' (x)i) And then screening againGo out f' (x)i)×f′(xi+1) X is less than or equal to 0iForming an array, and then combining xiThe array is used as a starting point of a Newton CG method, and an extreme point is solved;
step S3.2: sorting the IF from large to small by adopting a quick sorting method to obtain sorted IF for the IF _ l in each IF distribution interval, then selecting unmarked IF from the first IF, calculating Levenshtein distance backwards in sequence, marking the IF as a known class when the Levenshtein distance value meets a threshold value, adding the IF into a new class IF not, repeatedly executing the step of marking the class by adopting the method based on the Levenshtein distance until all the IF are classified, and finally outputting the IF distribution interval, the IF type and the IF number.
5. The method according to claim 4, wherein step S4 specifically comprises:
step S4.1: taking the IF distribution interval, the IF type and the IF number as input, selecting the IF with the maximum number in each IF distribution interval, marking all tracks according to the selected IF and IF distribution intervals, and converting the tracks into IF distribution vectors IFVector;
step S4.2: and inputting an IF distribution vector IFvector, and obtaining a second track cluster by adopting a K-means method.
6. The method according to claim 1, wherein step S5 specifically comprises:
s5.1: executing the track division method of step S1 with the second track cluster generated in step S4 as input, to obtain a corresponding fixed field IF, a length IF _ l of the fixed field, and position information IF _ S of the fixed field;
s5.2: taking the fixed field IF, the length IF _ l of the fixed field and the position information IF _ S of the fixed field obtained in the step S5.1 as input, executing the IF distribution fitting method of the step S2 to obtain an IF position distribution curve;
s5.3: taking the position distribution curve in the step S5.3 as input, executing the IF classification method in the step S3, and outputting all the IF in each IF distribution interval;
s5.4: inputting all IF in each distribution interval by adopting a separator reasoning method, marking corresponding tracks by using the IF, counting and deducing separators according to the appearance of the separators at the head and the tail of the keywords, and extracting the keywords in the IF by combining the separators;
s5.5: and storing the extracted keywords and the inferred separators in the IF as a signature database of the track flow classification.
7. A keyword-based network trajectory classification device is characterized by comprising:
the system comprises a track segmentation module, a data processing module and a data processing module, wherein the track segmentation module is used for carrying out primary classification on input mixed protocol tracks based on a K-means method of flow statistics characteristics to obtain K first track clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;
the IF distribution solving module is used for weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;
the IF classification module is used for obtaining an IF position distribution curve according to fitting, solving curve extremum by adopting a curve extremum solving method, extracting IF in each extremum interval of the curve, and performing IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;
the track clustering module is used for marking all tracks according to the maximum IF in each extreme value, clustering by adopting a K-Means method and outputting a second track cluster;
the keyword inference module is used for selecting a track cluster corresponding to the target safety protocol according to the second track cluster in the track clustering module; then, taking the track cluster of the target safety protocol as input, sequentially inputting the track cluster into a track segmentation module, an IF distribution solving module and an IF classification module to obtain the type of IF and the number of various IF contained in an extreme value interval corresponding to the target safety protocol, adopting a separator reasoning method, deducing separators by comparing the head and the tail of adjacent IF, separating keywords from IF according to the separators, and finally forming a signature database;
and the track classification module is used for marking a track flow to be processed by utilizing the formed signature database, converting the track flow into a vector and classifying the converted vector as the input of the k-means method.
8. The apparatus of claim 7, wherein the trajectory segmentation module is specifically configured to perform the steps of:
step S1.1: inputting a mixed protocol track < Flow, FlowID >, marking a network track based on Flow statistics characteristics to obtain a characteristic Vector < FlowID, Vector >, and obtaining K first track clusters < Cluster, FlowID > by taking the characteristic Vector as the input of a K-means clustering method;
step S1.2: taking < Cluster, FlowID, Flow > as the input of a track reverse ordering method, calculating the length Flow _ length of the Flow for the track Flow in each first track Cluster Cluster, then adopting a quick ordering method, ordering the Flow according to the Sequence of the Flow _ length from small to large and forming a queue Sequence < Cluster, FlowID, Flow >;
step S1.3: inputting Sequence < Cluster, FlowID, Flow >, taking out two flows from the head of the queue, wherein the two flows are numbered as i and j < Flow _ i, Flow _ j >; then using < Flow _ i, Flow _ j > as the input of the needleman _ Wunsch method to obtain the common fixed field IF and variable field VF of the Flow _ i, the Flow _ j; then, the length of IF is counted to obtain IF _ l and the distance between IF and the starting point of flow _ i to obtain IF _ s.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method of any one of claims 1 to 6.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the program.
CN201910281096.6A 2019-04-09 2019-04-09 Network track classification method and device based on keywords Active CN110061869B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910281096.6A CN110061869B (en) 2019-04-09 2019-04-09 Network track classification method and device based on keywords

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910281096.6A CN110061869B (en) 2019-04-09 2019-04-09 Network track classification method and device based on keywords

Publications (2)

Publication Number Publication Date
CN110061869A CN110061869A (en) 2019-07-26
CN110061869B true CN110061869B (en) 2022-04-15

Family

ID=67317603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910281096.6A Active CN110061869B (en) 2019-04-09 2019-04-09 Network track classification method and device based on keywords

Country Status (1)

Country Link
CN (1) CN110061869B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115021965B (en) * 2022-05-06 2024-04-02 中南民族大学 Method and system for generating attack data of intrusion detection system based on generation type countermeasure network

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
CN106452868A (en) * 2016-10-12 2017-02-22 中国电子科技集团公司第三十研究所 Network traffic statistics implement method supporting multi-dimensional aggregation classification
US9614773B1 (en) * 2014-03-13 2017-04-04 Juniper Networks, Inc. Systems and methods for automatically correcting classification signatures
CN109040081A (en) * 2018-08-10 2018-12-18 哈尔滨工业大学(威海) A kind of protocol fields conversed analysis system and method based on BWT
CN109447100A (en) * 2018-08-30 2019-03-08 天津理工大学 A kind of three-dimensional point cloud recognition methods based on the detection of B-spline surface similitude
CN109460469A (en) * 2018-10-25 2019-03-12 中南民族大学 A kind of method for digging and device of the security protocol format based on network path

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107819698A (en) * 2017-11-10 2018-03-20 北京邮电大学 A kind of net flow assorted method based on semi-supervised learning, computer equipment
CN107846326B (en) * 2017-11-10 2020-11-10 北京邮电大学 Self-adaptive semi-supervised network traffic classification method, system and equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102523167A (en) * 2011-12-23 2012-06-27 中山大学 Optimal segmentation method of unknown application layer protocol message format
US9614773B1 (en) * 2014-03-13 2017-04-04 Juniper Networks, Inc. Systems and methods for automatically correcting classification signatures
CN106452868A (en) * 2016-10-12 2017-02-22 中国电子科技集团公司第三十研究所 Network traffic statistics implement method supporting multi-dimensional aggregation classification
CN109040081A (en) * 2018-08-10 2018-12-18 哈尔滨工业大学(威海) A kind of protocol fields conversed analysis system and method based on BWT
CN109447100A (en) * 2018-08-30 2019-03-08 天津理工大学 A kind of three-dimensional point cloud recognition methods based on the detection of B-spline surface similitude
CN109460469A (en) * 2018-10-25 2019-03-12 中南民族大学 A kind of method for digging and device of the security protocol format based on network path

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
安全协议实施安全性分析综述;孟博等;《山东大学学报(理学版)》;20180131;第53卷(第1期);第1-18页 *

Also Published As

Publication number Publication date
CN110061869A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
US20210385236A1 (en) System and method for the automated detection and prediction of online threats
CN109951444B (en) Encrypted anonymous network traffic identification method
CN110808971B (en) Deep embedding-based unknown malicious traffic active detection system and method
CN111144459B (en) Unbalanced-class network traffic classification method and device and computer equipment
CN1881950A (en) Packet classification acceleration using spectral analysis
CN110796196A (en) Network traffic classification system and method based on depth discrimination characteristics
Liu et al. The detection method of low-rate DoS attack based on multi-feature fusion
WO2015154484A1 (en) Traffic data classification method and device
Sarraf Analysis and detection of ddos attacks using machine learning techniques
CN115600128A (en) Semi-supervised encrypted traffic classification method and device and storage medium
CN112861894A (en) Data stream classification method, device and system
WO2023087069A1 (en) Network traffic classification
CN111404942A (en) Vertical malicious crawler flow identification method based on deep learning
Saber et al. Online data center traffic classification based on inter-flow correlations
Zhang et al. Semi–supervised and compound classification of network traffic
Moreira et al. Packet vision: a convolutional neural network approach for network traffic classification
Yujie et al. End-to-end android malware classification based on pure traffic images
CN110061869B (en) Network track classification method and device based on keywords
KR102525593B1 (en) Network attack detection system and network attack detection method
CN108494620B (en) Network service flow characteristic selection and classification method
CN108141377B (en) Early classification of network flows
CN108307231B (en) Network video stream feature selection and classification method based on genetic algorithm
Ibrahimy et al. Lightweight Machine Learning Prediction Algorithm for Network Attack on Software Defined Network
Lu et al. Lightweight models for traffic classification: A two-step distillation approach
BP et al. Deep machine learning based Usage Pattern and Application classifier in Network Traffic for Anomaly Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190726

Assignee: Xiangyang Goode Cultural Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980041350

Denomination of invention: A Keyword Based Network Trajectory Classification Method and Device

Granted publication date: 20220415

License type: Common License

Record date: 20230908

Application publication date: 20190726

Assignee: Hubei Fengyun Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980041308

Denomination of invention: A Keyword Based Network Trajectory Classification Method and Device

Granted publication date: 20220415

License type: Common License

Record date: 20230908

EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190726

Assignee: Anhui Xiangshang Technology Service Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980054625

Denomination of invention: A Keyword based Network Trajectory Classification Method and Device

Granted publication date: 20220415

License type: Common License

Record date: 20240103

Application publication date: 20190726

Assignee: Anhui Xiangzhi Information Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980054624

Denomination of invention: A Keyword based Network Trajectory Classification Method and Device

Granted publication date: 20220415

License type: Common License

Record date: 20240103

Application publication date: 20190726

Assignee: HEFEI MUZHI INFORMATION TECHNOLOGY CO.,LTD.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980054622

Denomination of invention: A Keyword based Network Trajectory Classification Method and Device

Granted publication date: 20220415

License type: Common License

Record date: 20240103

Application publication date: 20190726

Assignee: Anhui Terze Technology Co.,Ltd.

Assignor: SOUTH CENTRAL University FOR NATIONALITIES

Contract record no.: X2023980054620

Denomination of invention: A Keyword based Network Trajectory Classification Method and Device

Granted publication date: 20220415

License type: Common License

Record date: 20240103