CN110061869B

CN110061869B - Network track classification method and device based on keywords

Info

Publication number: CN110061869B
Application number: CN201910281096.6A
Authority: CN
Inventors: 孟博; 何旭东; 王德军; 李子茂
Original assignee: South Central University for Nationalities
Current assignee: South Central Minzu University
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2022-04-15
Anticipated expiration: 2039-04-09
Also published as: CN110061869A

Abstract

The invention provides a keyword-based network track classification method and device, wherein the classification method comprises the steps of firstly obtaining a first track cluster by combining flow statistics characteristics with a K-means method, dividing each cluster of the first track cluster into a fixed field IF and a variable field VF as the input of a track division method, and calculating the length of each fixed field and the position information of the fixed field; then carrying out curve fitting by adopting an IF distribution fitting method to obtain an IF position distribution curve; and then obtaining the type of the IF contained in each extreme value interval and the quantity of various types of IF by adopting an IF classification method, inputting the type of the IF contained in each extreme value interval and the quantity of various types of IF into a track clustering method, outputting a second track cluster, deducing a separator by adopting a keyword deduction method, separating a keyword from the IF according to the separator, and finally forming a signature database. The invention can improve the classification accuracy and greatly improve the classification efficiency.

Description

Network track classification method and device based on keywords

Technical Field

The invention relates to the technical field of information security, in particular to a network track classification method and device based on keywords.

Background

The classification of network traffic is the basis for ensuring the security of network space. The traffic classification identifies different types of network protocol flows, and has great significance for guaranteeing the fields of communication safety, network management, network attack and defense, intrusion detection, protocol reversal and the like.

With the development of the internet, the 5G world-wide-article interconnection era is about to come. Terminals such as computers, mobile phones and sensors generate a large amount of flow, and the classification management of the large amount of flow provides challenges for the existing flow classification scheme. Traffic classification is crucial to network management, such as monitoring network resources, discovering and handling network failures in time, guaranteeing network quality of service, guaranteeing network efficiency, etc. On the one hand, for security purposes, traffic classification, filtering, and detection of malicious activities all require mastering the type of application flows in the network, and network operators can quickly react to potential events based on malicious traffic detection. On the other hand, the existence of a new class of applications (e.g., P2P, VoIP, and video streaming) in the Internet has increased dramatically. These applications are particularly difficult to classify and often have strict resource requirements for bandwidth (e.g., P2P) or QoS requirements (e.g., low delay and jitter for VoIP applications), which pose challenges to network operators.

In the prior art, a method for classifying network traffic which is widely applied is a deep packet inspection method, and the method classifies the network traffic by identifying signatures and fingerprints included in a network protocol stream and adopting a pattern matching method.

In the process of implementing the invention, the applicant of the invention finds that at least the following technical problems exist in the prior art:

because new network protocols continuously appear and network protocol versions are replaced, the deep packet inspection method needs to manually maintain a signature database; in addition, the classification efficiency of the deep packet inspection method is reduced sharply because the private protocol stream or the zero-day application protocol stream cannot directly obtain the signature. Due to the increase of network traffic, the traditional deep packet inspection method suffers from high computational complexity and is difficult to meet the requirement of a high-bandwidth network.

That is to say, the deep packet inspection method in the prior art needs to manually maintain the signature database, and has the technical problems of large workload, long time consumption, low efficiency and difficulty in being applied to a high-bandwidth network. Therefore, the efficient network track classification method based on the keywords has great significance.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for classifying network tracks based on keywords, so as to solve or at least partially solve the technical problems of large workload and low efficiency of the prior art.

The invention provides a keyword-based network track classification method in a first aspect, which comprises the following steps:

step S1: the method comprises the steps that an input mixed protocol track is preliminarily classified based on a K-means method of flow statistical characteristics to obtain K first track clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;

step S2: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;

step S3: obtaining an IF position distribution curve according to fitting, solving a curve extreme value by adopting a curve extreme value solving method, extracting IF in each extreme value interval of the curve, and carrying out IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;

step S4: marking all tracks according to the IF with the largest number in each extreme value, clustering by adopting a K-Means method, and outputting a second track cluster;

step S5: selecting a track cluster corresponding to the target security protocol according to the second track cluster in the step S4; then, taking the track cluster of the target safety protocol as input, sequentially executing the steps S1-S3 to obtain the type of the IF and the number of various types of IF contained in the extreme value interval corresponding to the target safety protocol, adopting a separator reasoning method, deducing separators by comparing the head and the tail of adjacent IF, separating keywords from the IF according to the separators, and finally forming a signature database;

step S6: and marking the track flow to be processed by utilizing the formed signature database, converting the track flow into vectors, and classifying the converted vectors as the input of a k-means method.

In one implementation, step S1 specifically includes:

step S1.1: inputting a mixed protocol track < Flow, FlowID >, marking a network track based on Flow statistics characteristics to obtain a characteristic Vector < FlowID, Vector >, and obtaining K first track clusters < Cluster, FlowID > by taking the characteristic Vector as the input of a K-means clustering method;

step S1.2: taking < Cluster, FlowID, Flow > as the input of a track reverse ordering method, calculating the length Flow _ length of the Flow for the track Flow in each first track Cluster Cluster, then adopting a quick ordering method, ordering the Flow according to the Sequence of the Flow _ length from small to large and forming a queue Sequence < Cluster, FlowID, Flow >;

step S1.3: inputting Sequence < Cluster, FlowID, Flow >, taking out two flows from the head of the queue, wherein the two flows are numbered as i and j < Flow _ i, Flow _ j >; then using < Flow _ i, Flow _ j > as the input of the needleman _ Wunsch method to obtain the common fixed field IF and variable field VF of the Flow _ i, the Flow _ j; then, the length of IF is counted to obtain IF _ l and the distance between IF and the starting point of flow _ i to obtain IF _ s.

In one implementation, step S2 specifically includes:

step S2.1: weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF, wherein the weight distribution method of the IF _ l weighting method comprises the following steps:

the larger the length IF _ l of the fixed field is, the larger the probability of the keyword appearing in the fixed field is, the larger the distributed weight is, when the length of the fixed field is between 1byte and 8 bytes, the weight is 0, when the length of the fixed field is between 9 bytes and 16 bytes, the weight is 1, when the length of the fixed field is between 17 bytes and 24 bytes, the weight is 2, and when the length of the fixed field is greater than or equal to 25 bytes, the weight is 3;

step S2.2: eliminating background noise by adopting a noise elimination method on the weight calculated in the step S2.1 to obtain a corrected IF _ l weight;

step S2.3: and (3) fitting the corrected IF _ l weight and the length of the fixed field by adopting a preset sub-B-spline curve to obtain an IF position distribution curve, wherein the curve equation of the B-spline is shown as a formula (1):

in the formula (1), d_i(i ═ 0, 1.. times.n) denotes a control point, N_i,k(u) (i ═ 0, 1.., n) denotes a k-th order canonical B spline basis function.

In one implementation, step S3 specifically includes:

step S3.1: a derivative function f' (x) of the IF position distribution curve is obtained, and an array x having intervals L in a defined domain is obtained_iL is smaller, and solve for f' (x)_i) Then f' (x) is screened out_i)×f′(x_i+1) X is less than or equal to 0_iForming an array, and then combining x_iThe array is used as a starting point of a Newton CG method, and an extreme point is solved;

step S3.2: sorting the IF from large to small by adopting a quick sorting method to obtain sorted IF for the IF _ l in each IF distribution interval, then selecting unmarked IF from the first IF, calculating Levenshtein distance backwards in sequence, marking the IF as a known class when the Levenshtein distance value meets a threshold value, adding the IF into a new class IF not, repeatedly executing the step of marking the class by adopting the method based on the Levenshtein distance until all the IF are classified, and finally outputting the IF distribution interval, the IF type and the IF number.

In one implementation, step S4 specifically includes:

step S4.1: taking the IF distribution interval, the IF type and the IF number as input, selecting the IF with the maximum number in each IF distribution interval, marking all tracks according to the selected IF and IF distribution intervals, and converting the tracks into IF distribution vectors IFVector;

step S4.2: and inputting an IF distribution vector IFvector, and obtaining a second track cluster by adopting a K-means method.

In one implementation, step S5 specifically includes:

s5.1: executing the track division method of step S1 with the second track cluster generated in step S4 as input, to obtain a corresponding fixed field IF, a length IF _ l of the fixed field, and position information IF _ S of the fixed field;

s5.2: taking the fixed field IF, the length IF _ l of the fixed field and the position information IF _ S of the fixed field obtained in the step S5.1 as input, executing the IF distribution fitting method of the step S2 to obtain an IF position distribution curve;

s5.3: taking the position distribution curve in the step S5.3 as input, executing the IF classification method in the step S3, and outputting all the IF in each IF distribution interval;

s5.4: inputting all IF in each distribution interval by adopting a separator reasoning method, marking corresponding tracks by using the IF, counting and deducing separators according to the appearance of the separators at the head and the tail of the keywords, and extracting the keywords in the IF by combining the separators;

s5.5: and storing the extracted keywords and the inferred separators in the IF as a signature database of the track flow classification.

Based on the same inventive concept, the second aspect of the present invention provides a keyword-based network trajectory classification apparatus, including:

the system comprises a track segmentation module, a data processing module and a data processing module, wherein the track segmentation module is used for carrying out primary classification on input mixed protocol tracks based on a K-means method of flow statistics characteristics to obtain K first track clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;

the IF distribution solving module is used for weighting the length IF _ l of the fixed field by adopting an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;

the IF classification module is used for obtaining an IF position distribution curve according to fitting, solving curve extremum by adopting a curve extremum solving method, extracting IF in each extremum interval of the curve, and performing IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;

the track clustering module is used for marking all tracks according to the maximum IF in each extreme value, clustering by adopting a K-Means method and outputting a second track cluster;

the keyword inference module is used for selecting a track cluster corresponding to the target safety protocol according to the second track cluster in the track clustering module; then, taking the track cluster of the target safety protocol as input, sequentially inputting the track cluster into a track segmentation module, an IF distribution solving module and an IF classification module to obtain the type of IF and the number of various IF contained in an extreme value interval corresponding to the target safety protocol, adopting a separator reasoning method, deducing separators by comparing the head and the tail of adjacent IF, separating keywords from IF according to the separators, and finally forming a signature database;

and the track classification module is used for marking a track flow to be processed by utilizing the formed signature database, converting the track flow into a vector and classifying the converted vector as the input of the k-means method.

In one implementation, the trajectory segmentation module is specifically configured to perform the following steps:

Based on the same inventive concept, a third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed, performs the method of the first aspect.

Based on the same inventive concept, a fourth aspect of the present invention provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method according to the first aspect when executing the program.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

the invention provides a network track classification method based on keywords, which takes a mixed network track as input and outputs the type of a target protocol track and a signature database. Firstly, dividing a track into a fixed field IF and a variable field VF by a track division method, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed field; then carrying out curve fitting by adopting an IF (Interchangeable field) distribution fitting method IF to obtain an IF position distribution curve; obtaining the type of IF contained in each extreme value interval and the quantity of various types of IF by adopting an IF classification method, outputting a second track cluster by adopting a track clustering method, deducing separators by adopting a keyword deduction method, separating keywords from IF according to the separators, and finally forming a signature database; and finally, marking the track flow to be processed by using the formed signature database, converting the track flow into a vector, and classifying the converted vector as the input of a k-means method.

Compared with the deep packet inspection method in the prior art, the method provided by the invention has the advantages that the classification is carried out by adopting the statistical characteristics through the track segmentation method, and the mixed track set with a less accurate classification result is obtained. Then, by an IF (empirical field) distribution fitting method, an IF classification method and a track clustering method, a peak interval IF is marked and converted into a vector, and the IF distribution curve where the peak is located obtains the number and the distance of various IF passing through the input extreme value according to the mixed track set output by the K-means method in the step S1; then marking all tracks according to the IF with the largest number in each extreme value; and clustering by adopting a K-Means method, outputting a clustering result and obtaining a more accurate result. That is, a more accurate classification result can be obtained through step S4. The invention further deduces the separator by a keyword inference method in the step S5, and can output a cluster with quite high purity and quite accurate through the peak interval IF and the cluster (the cluster refers to a class) output by the K-means method, namely, the accuracy of classification is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow diagram of a method for keyword-based network trajectory classification in one embodiment;

FIG. 2 is a general flow diagram of a keyword-based network trajectory classification method in one embodiment;

FIG. 3 is a flowchart illustrating the track segmentation method in step S1;

FIG. 4 is a schematic flow chart of the IF distribution fitting method in step S2;

FIG. 5 is a flowchart illustrating the IF classification method in step S3;

FIG. 6 is a flowchart illustrating the trajectory clustering method in step S4;

FIG. 7 is a flowchart illustrating the keyword inference method in step S5;

FIG. 8 is a schematic diagram of a trajectory segmentation algorithm;

FIG. 9 is a flow chart of a noise cancellation method;

FIG. 10 is a code diagram of an IF classification statistics algorithm;

FIG. 11 is a block diagram of an apparatus for keyword based network trajectory classification in one embodiment;

FIG. 12 is a block diagram of a computer-readable storage medium in an embodiment of the invention;

fig. 13 is a block diagram of a computer device in an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a network track classification method based on keywords, which classifies network tracks to be processed by a signature database formed by a track segmentation method, an IF (empirical field) distribution fitting method, an IF classification method, a track clustering method, a keyword inference method and a keyword inference method. That is, the invention forms the signature database by extracting the IF in the cluster with quite high purity, then only uses the signature database to perform the grouping marking and uses the K-means to classify, thereby improving the classification accuracy, greatly improving the classification efficiency and solving the technical problems of large workload and low efficiency of the method in the prior art.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The embodiment provides a network track classification method based on keywords, please refer to fig. 1, and the method includes:

step S1: the method comprises the steps that an input mixed protocol track is preliminarily classified based on a K-means method of flow statistical characteristics to obtain K first track clusters; in each cluster, the protocol tracks with similar lengths are arranged in a reverse order according to the length of the track flow, the protocol tracks with similar lengths are compared pairwise by adopting a needleman _ Wunsch method, the tracks are divided into fixed fields IF and variable fields VF, and the length IF _ l of each fixed field and the position information IF _ s of each fixed field are calculated.

Specifically, in step S1, a first trajectory cluster of a preliminary classification is obtained by using a trajectory segmentation method. The input of the track segmentation method is a mixed network track flow, and firstly, the flow is initially divided into clusters by a K-means method based on flow statistical characteristics; then, in a cluster, arranging according to the length of the track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, wherein the needleman _ Wunsch method can mark fixed fields in the tracks; finally, the track is divided into a fixed field IF and a variable field VF and the length IF _ l of each IF and the position information IF _ s of the fixed field are calculated, wherein the position information IF _ s of the fixed field refers to the distance from the fixed field to the first character of the track. A schematic flow chart of the trajectory segmentation method is shown in fig. 3.

specifically, in step S2, an IF position distribution curve is obtained based on step S1 by an IF distribution fitting method. IF distribution fitting method as shown in fig. 4, first, IF _ l, and IF _ s are input; then, weighting IF _ l by an IF weighting method; alternatively, a noise cancellation method may be employed to remove background noise; and finally, performing curve fitting on the weighted IF by adopting a curve fitting method, and outputting an IF distribution curve.

Step S3: obtaining an IF position distribution curve according to fitting, solving a curve extreme value by adopting a curve extreme value solving method, extracting IF in each extreme value interval of the curve, and carrying out IF classification statistics based on Levenshtein distance; and then outputting the type of the IF contained in each extremum interval and the number of various types of IF.

Specifically, referring to fig. 5, in step S3, the IF classification method is used to obtain the types of IFs and the number of types of IFs included in each extremum interval based on step S2. The distance of the obtained IF means that there are a plurality of different IF in each peak interval, because the input trace types are different, different traces have a plurality of IF in the same peak interval. All the IFs corresponding to the respective extremum sections, and IF _ l and IF _ S are output in step S3.

Step S4: and marking all tracks according to the IF with the largest number in each extreme value, clustering by adopting a K-Means method, and outputting a second track cluster.

Specifically, inputting various IF sets in a distribution interval by a track clustering method, and selecting the IF with the largest quantity in each maximum value area; then, marking all tracks according to the selected IF, and converting the tracks into vectors; finally, clustering by adopting a K-Means method and outputting a clustering result. A schematic diagram of the trajectory clustering method is shown in fig. 6.

Since a network trace contains many IFs and VFs, the IF distribution varies from network trace to network trace. First, step S1 uses statistical characteristics, length, offset, arrival time, and the like. These are simply features of the track as a whole, and IF it is desired to identify which class the track is, features of the IF contained by the track are required. Where the IF characteristics include the type of IF, IF l (IF length), IF s (distance of IF in the track to the first character).

Step S5: selecting a track cluster corresponding to the target security protocol according to the second track cluster in the step S4; and then, taking the track cluster of the target safety protocol as input, sequentially executing the steps S1-S3 to obtain the type of the IF and the number of various types of IF contained in the extreme value interval corresponding to the target safety protocol, deducing separators by comparing the head and the tail of adjacent IF by adopting a separator reasoning method, separating keywords from the IF according to the separators, and finally forming a signature database.

Specifically, the steps S1 to S3 are sequentially executed with the trajectory cluster of the target security protocol as an input, that is, the trajectory segmentation method, the IF distribution fitting method, and the IF classification method of step S1 are performed. A schematic diagram of the keyword inference method is shown in fig. 7. Firstly, inputting a track cluster, and selecting a track cluster of a target safety protocol; secondly, inputting the classified track clusters of the safety protocols into a track segmentation method, and obtaining the quantity statistics of various IF at the peak value through an IF distribution fitting method and an IF classification method; and finally, deducing the separators by comparing the head and the tail of adjacent IF by adopting a separator reasoning method, separating the keywords from the IF, and taking the keywords and the separators as a new signature database.

In particular, after the signature data is formed, it can be used to label the trace stream and converted into vectors for classification as input to the k-means method. The signature database contains the characteristics capable of being classified, so that subsequent classification can be performed through the signature database, the classification accuracy is improved, and meanwhile, the classification efficiency can be greatly improved.

In general, please refer to fig. 2, which is a general flow of the keyword-based network trajectory classification method according to an embodiment. The steps S1 to S5 are included.

In one embodiment, step S1 specifically includes:

Specifically, the flow statistics feature set has 18 sets of features, such as: average packet length, standard deviation of inter-packet arrival time, total traffic length (in bytes and/or packets), fourier transform of inter-packet arrival time, etc. The flow-based set of statistical features is shown in table 1. The trace stream is stored in a file in a PCAP format, and 18 groups of statistical characteristics of the stream can be obtained through a simple statistical method and the stream is converted into a characteristic Vector < FlowID, Vector >. Where Vector contains 18 sets of eigenvalues, as shown in table 1.

TABLE 1 statistical feature set of streams

The input of the K-means clustering method is < FlowID, Vector >, the class number K can be preset, and the clustering result is observed, the K value is determined, and the track Cluster < Cluster, FlowID > is obtained. The K-means clustering method can adopt the existing clustering method, and the specific process is not described in detail.

The idea of the invention is that the security protocol tracks with similar lengths experience similar states, so that the distribution of the generated key word positions is more concentrated. The invention arranges the tracks in a reverse order according to the lengths of the tracks, groups every two adjacent tracks, compares the safety protocol tracks with similar total length by adopting a needleman _ Wunsch method, preferentially aligns longer keywords, and finally outputs IF and counts the positions of the keywords. The specific algorithm is shown in fig. 8.

In particular, the needleman _ Wunsch method is commonly used in bioinformatics, and is one of the earliest methods for comparing biological sequences using dynamic programming techniques, which searches for homology between proteins or genes by aligning the sequences of the proteins or genes. Since the N-W method is capable of handling large amounts of structured and complex data.

In the embodiment of the invention, the input of the message segmentation method is security protocol flows, and each flow comprises a plurality of keywords, separators and change fields. For example: report/dc00321, report is a keyword,/is a delimiter, and dc00321 is a change field. The N-W method is commonly used in DNA or protein alignment studies. A two-dimensional matrix is filled through similar rewards and gap penalties, a backtracking path is selected, and then a comparison result is output. For example, report/dc00321 and report/dc04369, two parts can be obtained by sequence alignment, fixed field IF: report/dc0, change field VF: 0321 and 4369. Since the separators do not change, the first three characters of the changed field change less, and the N-W method cannot distinguish, IF three-way alignment is used, such as report/dc00321, report/dc04369 and report/da45813, the IF of the alignment result is still report/dc0, and the result obtained by only using the multi-way alignment can only take care of most of the same sequences.

Inputting a trace set FlowSet, firstly obtaining DFlowSet by arranging according to the length of Flow in a reverse order, then outputting DFlow1 and DFlow2 from DFlowSet once until DFlowSet is empty, then inputting DFlow1 and DFlow2 into a needleman _ Wunsch method to obtain IF, and counting IF _ i, IF _ s and VF.

Preferably, the invention also improves the needleman _ Wunsch to improve the classification accuracy. The scoring system of the improved needleman _ Wunsch method follows three principles: as many matches as possible, the consecutive fields are preferentially aligned, resulting in only the slot corresponding to the first sequence. M_ijIs the value of j column in row i of the scoring matrix, M_ijThe calculation formula (2) shows that w is a penalty interval, S_ijIs the similarity of the ith character of message a to the jth character of b, S_ijThe scoring criteria of (1) are as in equation 1, backtracking from M_ijInitially, M is selected_ijThe upper, left, and upper left bins of (1) are the largest values, and if the same size bin is encountered, the upper left and upper left are preferably selected. S_ijThe longest most similar fields can be preferentially matched together by the running match score, and the only room to generate a corresponding first sequence is to be able to output the position of the keyword directly from the matching result. Improved scoring system S_ijIs shown in equation (3).

M_ij＝max{M_i-1,j-1+S_ij,M_i,j-1+w,M_i-1,j+ w equation (2)

Wherein, in the formula (3), a_i-1Refers to the i-1 st character of the message a, b_j-1Similarly, k and p are constants that control the running match score.

In one embodiment, step S2 specifically includes:

Specifically, when the length (IF _ l) of the fixed field is larger, the probability of the occurrence of the keyword in the fixed field is larger, and the weight value is assigned to be larger. When the length of the fixed field reaches 32 bytes, indicating that 4 consecutive characters appear on both tracks, the weight is set to a maximum of 3.

In a specific implementation process, due to the inherent defect of needleman _ Wunsch, a wrong matching can be generated, so that an experimental result is influenced. A flow chart of the noise cancellation method is shown in fig. 9. The method firstly deletes the weight value of IF with the weight value of 1, secondly judges the data type of IF, IF IF is 10-ary number, IF _ w-k is 0.192/IF _ w is IF _ w; where the value of k is 2, IF the IF is a 16-ary number

Where k is 1.5. It is empirically calculated that 0.192 is the average noise generated by two 10-ary random numbers and 0.107 is the average noise generated by two 16-ary random numbers.

In step 2.3, curve fitting means that a distribution function which is convenient for computer processing is calculated from a plurality of discrete point sets, the distribution function can eliminate wrong data and interference, and can compensate missing and predict future trends.

In one embodiment, step S3 specifically includes:

Specifically, a curve fitted by a B spline is a complex polynomial curve, and the Newton CG method is an optimization algorithm for obtaining an extreme value of the curve.

Step 3.2 IF classification statistical method, using x_a,iRepresenting minimum points by x_b,i+1Representing the maximum point. Experiments show that the IF is intensively distributed

In the present invention, the interval is an IF distribution interval. Inputting IF distribution intervals, IF _ s, IF _ l and IF by an IF classification statistical method, and firstly, sorting the IF from large to small by adopting a quick sorting method for the IF _ l in each IF distribution interval to obtain sorted IF; secondly, selecting unmarked IF from the first IF, calculating Levenshtein distance backwards in sequence, considering the IF to be in a known class when the Levenshtein distance value meets a threshold value, marking the IF as the known class, and adding the IF into a new class IF not; and thirdly, repeating the second step until all the IF are classified. And finally outputting the IF distribution interval, the IF type and the IF number. The algorithm of the IF classification statistical method is shown in fig. 10.

The Levenshtein distance is one of common editing distances, two character strings A and B are input by the Levenshtein distance method, the character string A is converted into the character string B through a rule of operating one character at a time, and the operation times are output.

In one embodiment, step S4 specifically includes:

In one embodiment, step S5 specifically includes:

Example two

The present embodiment provides a keyword-based network trajectory classification device, please refer to fig. 11, which includes:

the trajectory segmentation module 201 is configured to perform preliminary classification on the input mixed protocol trajectories based on a K-means method of flow statistics characteristics to obtain K first trajectory clusters; in each cluster, arranging according to the length of a track flow in a reverse order, comparing protocol tracks with similar lengths pairwise by adopting a needleman _ Wunsch method, dividing the tracks into fixed fields IF and variable fields VF, and calculating the length IF _ l of each fixed field and the position information IF _ s of the fixed fields;

an IF distribution solving module 202, configured to weight the length IF _ l of the fixed field by using an IF _ l weighting method to obtain a weight IF _ w of the IF; performing curve fitting on the IF by using an IF _ w and an IF _ s as input by adopting a curve fitting method to obtain an IF position distribution curve;

the IF classification module 203 is used for obtaining an IF position distribution curve according to fitting, solving curve extremum by adopting a curve extremum solving method, extracting IF in each extremum interval of the curve, and performing IF classification statistics based on Levenshtein distance; then outputting the type of the IF contained in each extreme value interval and the number of various types of IF;

the track clustering module 204 is used for marking all tracks according to the maximum IF in each extreme value, clustering by adopting a K-Means method and outputting a second track cluster;

the keyword inference module 205 is configured to select a track cluster corresponding to the target security protocol according to the second track cluster in the track clustering module; then, taking the track cluster of the target safety protocol as input, sequentially inputting the track cluster into a track segmentation module, an IF distribution solving module and an IF classification module to obtain the type of the IF and the number of various IFs contained in an extreme value interval corresponding to the target safety protocol, deducing separators by comparing the head and the tail of adjacent IFs by adopting a separator reasoning method, separating keywords from the IF according to the separators, and finally forming a signature database;

and the track classification module 206 is configured to mark a track stream to be processed by using the formed signature database, convert the track stream into a vector, and classify the converted vector as an input of the k-means method.

In one implementation, the trajectory segmentation module 201 is specifically configured to perform the following steps:

In one implementation, the IF distribution solving module 202 is specifically configured to perform the following steps:

In one implementation, the IF classification module 203 is specifically configured to perform the following steps:

step S3.1: the derivative function f' (x) of the IF position distribution curve is obtained, and the interval within the defined domain is obtainedArray x of L_iL is smaller, and solve for f' (x)_i) Then f' (x) is screened out_i)×f′(x_i+1) X is less than or equal to 0_iForming an array, and then combining x_iThe array is used as a starting point of a Newton CG method, and an extreme point is solved;

In one implementation, the track clustering module 204 is specifically configured to perform the following steps:

In one implementation, the keyword inference module 205 is specifically configured to perform the following steps:

Since the apparatus introduced in the second embodiment of the present invention is an apparatus used for implementing the keyword-based network trajectory classification method in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the apparatus based on the method introduced in the first embodiment of the present invention, and thus details are not described herein again. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.

EXAMPLE III

Based on the same inventive concept, the present application further provides a computer-readable storage medium 300, please refer to fig. 12, on which a computer program 311 is stored, which when executed implements the method in the first embodiment.

Since the computer-readable storage medium introduced in the third embodiment of the present invention is a computer-readable storage medium used for implementing the keyword-based network trajectory classification method in the first embodiment of the present invention, based on the method introduced in the first embodiment of the present invention, persons skilled in the art can understand the specific structure and deformation of the computer-readable storage medium, and therefore details are not described here. Any computer readable storage medium used in the method of the first embodiment of the present invention falls within the intended scope of the present invention.

Example four

Based on the same inventive concept, the present application further provides a computer device, please refer to fig. 13, which includes a memory 401, a processor 402, and a computer program 403 stored in the memory and running on the processor, and when the processor 402 executes the above program, the method in the first embodiment is implemented.

Since the computer device introduced in the fourth embodiment of the present invention is a computer device used for implementing the keyword-based network trajectory classification method in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the computer device based on the method introduced in the first embodiment of the present invention, and thus details are not described herein. All the computer devices used in the method in the first embodiment of the present invention are within the scope of the present invention.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass such modifications and variations.

Claims

1. A network track classification method based on keywords is characterized by comprising the following steps:

2. The method according to claim 1, wherein step S1 specifically comprises:

3. The method according to claim 1, wherein step S2 specifically comprises:

4. The method according to claim 3, wherein step S3 specifically comprises:

step S3.1: a derivative function f' (x) of the IF position distribution curve is obtained, and an array x having intervals L in a defined domain is obtained_iAnd solve for f' (x)_i) And then screening againGo out f' (x)_i)×f′(x_i+1) X is less than or equal to 0_iForming an array, and then combining x_iThe array is used as a starting point of a Newton CG method, and an extreme point is solved;

5. The method according to claim 4, wherein step S4 specifically comprises:

6. The method according to claim 1, wherein step S5 specifically comprises:

7. A keyword-based network trajectory classification device is characterized by comprising:

8. The apparatus of claim 7, wherein the trajectory segmentation module is specifically configured to perform the steps of:

9. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed, implements the method of any one of claims 1 to 6.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the program.