CN109309630B - Network traffic classification method and system and electronic equipment - Google Patents
Network traffic classification method and system and electronic equipment Download PDFInfo
- Publication number
- CN109309630B CN109309630B CN201811113686.XA CN201811113686A CN109309630B CN 109309630 B CN109309630 B CN 109309630B CN 201811113686 A CN201811113686 A CN 201811113686A CN 109309630 B CN109309630 B CN 109309630B
- Authority
- CN
- China
- Prior art keywords
- network
- data
- address
- network traffic
- flow
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 110
- 238000013145 classification model Methods 0.000 claims abstract description 32
- 238000002372 labelling Methods 0.000 claims abstract description 26
- 239000011159 matrix material Substances 0.000 claims description 26
- 230000005540 biological transmission Effects 0.000 claims description 18
- 230000015654 memory Effects 0.000 claims description 16
- 230000007246 mechanism Effects 0.000 claims description 14
- 238000005457 optimization Methods 0.000 claims description 14
- 238000010606 normalization Methods 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 7
- 230000004927 fusion Effects 0.000 claims description 6
- 238000012163 sequencing technique Methods 0.000 claims description 6
- 238000010276 construction Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 description 18
- 238000004422 calculation algorithm Methods 0.000 description 9
- 238000010586 diagram Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 238000012549 training Methods 0.000 description 5
- 238000007637 random forest analysis Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 238000012512 characterization method Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000003064 k means clustering Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006116 polymerization reaction Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/24—Traffic characterised by specific attributes, e.g. priority or QoS
- H04L47/2441—Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The application relates to a network traffic classification method, a network traffic classification system and electronic equipment. The method comprises the following steps: step a: collecting network flow data and labeling the network flow data; step b: extracting a bidirectional flow characteristic set according to the labeled network flow data; step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model. The network traffic is classified by utilizing the bidirectional flow characteristics in the network traffic data, a large number of new applications in the internet can be accurately identified and classified, the classification accuracy is improved, and the high precision and high performance of network traffic classification can be effectively guaranteed.
Description
Technical Field
The present application relates to the field of network traffic classification technologies, and in particular, to a method, a system, and an electronic device for classifying network traffic.
Background
With the high-speed popularity of the internet, modern network environments have become increasingly complex and diverse due to the emergence of a large number of new applications. Traffic classification and network application identification play an important role in network management services and security systems, such as quality of service, intrusion detection systems, and traffic management systems. If the flow in the network system can be accurately classified and applied and identified, the network safety and the network management service efficiency are greatly improved, and the system time and the memory overhead can be reduced.
At present, the existing network traffic classification method mainly includes:
firstly, classifying network traffic based on characterization learning: the method comprises the steps of preprocessing the acquired network traffic data, extracting the characteristics of the preprocessed network traffic data by using a characterization learning algorithm, generating network flow vectors from the network traffic data, and classifying the network traffic data according to the network flow vectors, so that the network traffic can be classified efficiently.
Secondly, network traffic classification based on semi-supervised learning: acquiring network flows of marked types and unmarked types, and extracting flow characteristics in each network flow according to a preset fixed quantity to obtain a network flow characteristic vector; according to the marked type of the network flow, calculating the information gain of each flow characteristic in a preset fixed quantity, and performing characteristic weighting on each flow characteristic according to the information gain; mixing the network flows of the marked type and the unmarked type, and clustering the mixed network flows by using a k-means algorithm to obtain k clusters; acquiring the number of marked network flow feature vectors in each cluster of the k clusters, and determining the proportion value of each type in each cluster; wherein the fraction value is equal to a ratio of a number of tagged network flow feature vectors of each type to a number of total tagged network flow feature vectors in the cluster; when the sum of the total number of the marked network flow characteristic vectors in each cluster is smaller than a preset network flow threshold value, judging the corresponding cluster as an unknown protocol cluster, otherwise, judging the corresponding cluster as a type with the largest proportion in the marked network flow characteristic vectors; repeating the two steps until the k clusters determine the flow cluster of the flow type; and taking the flow cluster with the judged flow type as training data to train a flow classifier on the line. The method utilizes the advantages of semi-supervised learning, and has better accuracy and stability compared with the traditional supervised learning algorithm which only uses labeled data to train the model.
Thirdly, self-adaptive semi-supervised network traffic classification: acquiring network flows of marked types and unmarked types, and extracting preset fixed quantity of flow characteristics in each network flow to obtain a network flow characteristic vector; calculating the centroid of the network flow feature vector set in each type according to the marked network flow feature vectors to obtain a vector set M; taking the vector set M as an initial central point of k-means clustering, carrying out self-adaptive semi-supervised k-means clustering on a mixed marked type and unmarked type network flow characteristic vector set X, and outputting clustering of k-means; mapping the obtained network flow in each type of cluster to the flow type according to the maximum posterior probability of the marked network flow characteristic vector of each cluster in the output cluster to obtain the flow cluster of the known type; and taking the known type of flow cluster as training data to train a flow classifier on the outlet.
In summary, the existing network traffic classification methods mainly focus on network traffic classification at the algorithm level, and all kinds of optimization and improvement algorithms are proposed for the classification algorithm part in the training phase, but the problem of how to extract a large number of relevant effective feature sets from network data packets is not solved, and a large number of new applications in the internet cannot be accurately identified and classified.
Disclosure of Invention
The application provides a network traffic classification method, a network traffic classification system and electronic equipment, and aims to solve at least one of the technical problems in the prior art to a certain extent.
In order to solve the above problems, the present application provides the following technical solutions:
a network traffic classification method comprises the following steps:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step a, the acquiring network traffic data and the labeling the network traffic data specifically include:
step a 1: selecting an application category in the network traffic;
step a 2: collecting a network flow data packet corresponding to each application and a system network log of a corresponding time period;
step a 3: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications;
step a 4: and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step b, the extracting a bidirectional flow feature set according to the labeled network traffic data specifically includes:
step b 1: analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
step b 2: finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
step b 3: finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
step b 4: and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the step b further comprises the following steps: and optimizing the bidirectional flow feature set by using a maximum variance interpretation mechanism.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the optimizing the bidirectional flow feature set by using the maximum variance interpretation mechanism specifically comprises:
step b 5: performing standard normalization on the network traffic data;
step b 6: on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
step b 7: subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
step b 8: calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
step b 9: calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
step b 10: projecting the network traffic data onto the N eigenvectors;
step b 11: and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
Another technical scheme adopted by the embodiment of the application is as follows: a network traffic classification system comprising:
a data acquisition module: the system is used for collecting network flow data;
a data preprocessing module: the system is used for labeling the network flow data;
a feature extraction module: the bidirectional flow characteristic set is used for extracting a bidirectional flow characteristic set according to the network flow data subjected to the labeling processing;
a model construction module: and the bidirectional flow feature set is used for constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The technical scheme adopted by the embodiment of the application further comprises the following steps:
the data acquisition module specifically acquires network traffic data and comprises: selecting application types in network flow, and collecting a network flow data packet corresponding to each application and a system network log corresponding to a time period;
the data preprocessing module is used for labeling the network traffic data and specifically comprises the following steps: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications; and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the feature extraction module specifically extracts a bidirectional flow feature set according to the labeled network traffic data, and includes:
analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
The technical scheme adopted by the embodiment of the application further comprises a feature optimization module, wherein the feature optimization module is used for optimizing the bidirectional flow feature set by utilizing a maximum variance interpretation mechanism.
The technical scheme adopted by the embodiment of the application further comprises the following steps: the feature optimization module specifically optimizes the bidirectional flow feature set by using a maximum variance interpretation mechanism, and comprises the following steps:
performing standard normalization on the network traffic data;
on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
projecting the network traffic data onto the N eigenvectors;
and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
The embodiment of the application adopts another technical scheme that: an electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the one processor to cause the at least one processor to perform the following operations of the network traffic classification method described above:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
Compared with the prior art, the embodiment of the application has the advantages that: the network traffic classification method, the network traffic classification system and the electronic equipment in the embodiment of the application classify the network traffic by using the bidirectional flow characteristics in the network traffic data, and can accurately identify and classify a large number of new applications in the internet; meanwhile, the method of the maximum variance interpretation mechanism is used for carrying out optimization association on the bidirectional flow characteristics, so that the high cohesion of the bidirectional flow characteristics is guaranteed, the classification accuracy is improved, and the high precision and the high performance of network flow classification can be effectively guaranteed.
Drawings
Fig. 1 is a flowchart of a network traffic classification method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a process of collecting and labeling network traffic data;
FIG. 3 is a schematic diagram of a bidirectional flow feature set extraction and optimization process;
fig. 4 is a schematic structural diagram of a network traffic classification system according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a hardware device of a network traffic classification method according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
Please refer to fig. 1, which is a flowchart illustrating a network traffic classification method according to an embodiment of the present application. The network traffic classification method of the embodiment of the application comprises the following steps:
step 100: collecting network flow data and labeling the network flow data;
in step 100, the process of collecting and labeling network traffic data is shown in fig. 2, and the specific steps are as follows:
step 101: selecting an application category in the network traffic;
step 102: continuously capturing fixed application traffic through high-performance network monitoring software;
step 103: collecting a network flow data packet corresponding to each application type and a system network log of a corresponding time period;
step 104: analyzing the network flow data packet, and finding out the natural attribute of each application and key information communicated with other applications, such as an IP address, a transmission protocol and the like;
step 105: and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining the IP address and the transmission protocol to finish the labeling processing of the network flow data.
Step 200: extracting a bidirectional flow characteristic set from the labeled network flow data, and optimizing the bidirectional flow characteristic set by using a maximum variance interpretation mechanism;
in step 200, the process of extracting and optimizing the bidirectional flow feature set is shown in fig. 3, and specifically includes the following steps:
step 201: analyzing according to the labeled network flow data, and respectively counting bidirectional (forward and reverse) network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network flow data, wherein each pair of { source IP address, destination IP address } has two network flow information in opposite directions;
step 202: finding forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets F1 in each forward network flow;
step 203: finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets F2 in each reverse network flow;
step 204: combining forward and reverse network flow feature sets { F1, F2} between each pair of { source IP address, destination IP address }, to form a bidirectional flow feature set F of M-dimensional features, denoted as F { F1, F2 };
in step 204, a uniform optimization is performed by combining all the forward and reverse network flow feature sets.
Step 205: performing standard normalization on the network flow data, and normalizing the network flow data set into a data set with a mean value of 0 and a variance of 1; the normalized formula is: x ═ x/δ, where u is the mean of all network traffic data and δ is the standard deviation of all network traffic data;
step 206: on the network flow data, the average value of each feature on a bidirectional flow feature set F is obtained;
step 207: subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
step 208: calculating a covariance matrix of a bidirectional flow feature set F, and sequencing the covariance matrix from small to large according to a variance value of each feature on a main diagonal in the covariance matrix to obtain an N-dimensional feature with the highest and closest relevance in the bidirectional flow feature set F;
in step 208, the covariance between every two features is on the main diagonal, and the covariance is greater than 0, which indicates that the two features are in a positive correlation trend; the covariance is less than 0, which indicates that the two characteristics are in a negative correlation trend; covariance equal to 0, indicating independence between the two features; the larger the absolute value of the covariance, the tighter the connection between two features and vice versa. According to the 5 conditions, the N-dimensional features with the highest and closest relevance in the bidirectional flow feature set F can be calculated. The method and the device utilize a maximum variance interpretation mechanism to perform priority combination on the features with the closest association degree on the bidirectional network flow feature sets in the network flow data, and screen out the feature sets which can most embody the network flow categories.
Step 209: calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
step 210: projecting the network flow data to the selected N eigenvectors: assuming that the sample number of the network traffic data is p, the feature number is q, a sample matrix obtained by subtracting a feature mean value from the network traffic data is DataTransform (p × q), a covariance matrix of a bidirectional flow feature set is p × q, and a matrix formed by N selected feature vectors is EigenVectors (q × N), the projected network traffic data is: OptimizeData (p × N) ═ DataTransform (p × q) X EigenVectors (q × N);
in step 210, by projecting the network traffic data onto the feature vector corresponding to the optimized bidirectional flow feature, the degree of polymerization of the data can be improved, the influence of noise data can be reduced, and the classification accuracy can be improved.
Step 211: and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
Step 300: based on the optimized bidirectional flow characteristic set, a classification model is constructed by adopting a random forest algorithm of supervised learning, and a classification result of the network flow data is output through the classification model;
in step 300, a random forest algorithm of supervised learning is adopted for modeling, the optimized bidirectional flow feature set is input into a classification model for classification training, and the performance of the classification model is optimized through performance evaluation of the classification model. The trained classification model is tested by using the test data set in the verification stage, and the test result shows that the classification model constructed based on the optimized bidirectional flow characteristic set obviously has very high classification precision, so that the classification efficiency can be improved on the premise of ensuring higher classification accuracy, and the overall performance is improved.
Please refer to fig. 4, which is a block diagram of a network traffic classification system according to an embodiment of the present application. The network traffic classification system comprises a data acquisition module, a data preprocessing module, a feature extraction module, a feature optimization module and a model construction module.
A data acquisition module: the system is used for collecting network flow data; the network flow data acquisition mode comprises the following steps: selecting application types in the network flow, continuously capturing fixed application type flow through high-performance network monitoring software, and collecting network flow data packets corresponding to each application type and system network logs corresponding to a time period.
A data preprocessing module: the system is used for labeling the network flow data; the labeling process of the network traffic data specifically comprises the following steps: analyzing the network flow data packet, and finding out the natural attribute of each application and key information communicated with other applications, such as an IP address, a transmission protocol and the like; and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining the IP address and the transmission protocol to finish the labeling processing of the network flow data.
A feature extraction module: the bidirectional flow feature set is used for extracting the bidirectional flow feature set from the labeled network flow data; specifically, the bidirectional flow feature set extraction method includes:
a. analyzing according to the labeled network flow data, and respectively counting bidirectional (forward and reverse) network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network flow data, wherein each pair of { source IP address, destination IP address } has two network flow information in opposite directions;
b. finding forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets F1 in each forward network flow;
c. finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets F2 in each reverse network flow;
d. the forward and reverse network flow feature sets { F1, F2} between each pair of { source IP address, destination IP address } are combined to form a bi-directional flow feature set F of M-dimensional features, denoted as F { F1, F2 }.
A feature optimization module: the device is used for optimizing the extracted bidirectional flow characteristic set by utilizing a maximum variance interpretation mechanism; specifically, the bidirectional flow feature set optimization method includes:
a. performing standard normalization on the network flow data, and normalizing the network flow data set into a data set with a mean value of 0 and a variance of 1; the normalized formula is: x ═ x/δ, where u is the mean of all network traffic data and δ is the standard deviation of all network traffic data;
b. on the network flow data, the average value of each feature on a bidirectional flow feature set F is obtained;
c. subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
d. calculating a covariance matrix of a bidirectional flow feature set F, and sequencing the covariance matrix from small to large according to a variance value of each feature on a main diagonal in the covariance matrix to obtain an N-dimensional feature with the highest and closest relevance in the bidirectional flow feature set F; the main diagonal line is the covariance between every two characteristics, the covariance is greater than 0, and the two characteristics show positive correlation trend; the covariance is less than 0, which indicates that the two characteristics are in a negative correlation trend; covariance equal to 0, indicating independence between the two features; the larger the absolute value of the covariance, the tighter the connection between two features and vice versa. According to the 5 conditions, the N-dimensional features with the highest and closest relevance in the bidirectional flow feature set F can be calculated. The method and the device utilize a maximum variance interpretation mechanism to perform priority combination on the features with the closest association degree on the bidirectional network flow feature sets in the network flow data, and screen out the feature sets which can most embody the network flow categories.
e. Calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
f. projecting the network flow data to the selected N eigenvectors: assuming that the sample number of the network traffic data is p, the feature number is q, a sample matrix obtained by subtracting a feature mean value from the network traffic data is DataTransform (p × q), a covariance matrix of a bidirectional flow feature set is p × q, and a matrix formed by N selected feature vectors is EigenVectors (q × N), the projected network traffic data is: OptimizeData (p × N) ═ DataTransform (p × q) X EigenVectors (q × N);
g. and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
A model construction module: the method comprises the steps that a classification model is constructed by adopting a random forest algorithm of supervised learning based on an optimized bidirectional flow characteristic set, and a classification result of network flow data is output through the classification model; the method comprises the steps of modeling by adopting a random forest algorithm of supervised learning, inputting an optimized bidirectional flow characteristic set into a classification model for classification training, and optimizing the performance of the classification model through performance evaluation of the classification model. The trained classification model is tested by using the test data set in the verification stage, and the test result shows that the classification model constructed based on the optimized bidirectional flow characteristic set obviously has very high classification precision, so that the classification efficiency can be improved on the premise of ensuring higher classification accuracy, and the overall performance is improved.
Fig. 5 is a schematic structural diagram of a hardware device of a network traffic classification method according to an embodiment of the present application. As shown in fig. 5, the device includes one or more processors and memory. Taking a processor as an example, the apparatus may further include: an input system and an output system.
The processor, memory, input system, and output system may be connected by a bus or other means, as exemplified by the bus connection in fig. 5.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor executes various functional applications and data processing of the electronic device, i.e., implements the processing method of the above-described method embodiment, by executing the non-transitory software program, instructions and modules stored in the memory.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processing system over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input system may receive input numeric or character information and generate a signal input. The output system may include a display device such as a display screen.
The one or more modules are stored in the memory and, when executed by the one or more processors, perform the following for any of the above method embodiments:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.
Embodiments of the present application provide a non-transitory (non-volatile) computer storage medium having stored thereon computer-executable instructions that may perform the following operations:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
Embodiments of the present application provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the following:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: and constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network flow data through the classification model.
The network traffic classification method, the network traffic classification system and the electronic equipment in the embodiment of the application classify the network traffic by using the bidirectional flow characteristics in the network traffic data, and can accurately identify and classify a large number of new applications in the internet; meanwhile, the method of the maximum variance interpretation mechanism is used for carrying out optimization association on the bidirectional flow characteristics, so that the high cohesion of the bidirectional flow characteristics is guaranteed, the classification accuracy is improved, and the high precision and the high performance of network flow classification can be effectively guaranteed.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (7)
1. A network traffic classification method is characterized by comprising the following steps:
step a: collecting network flow data and labeling the network flow data;
step b: extracting a bidirectional flow characteristic set according to the labeled network flow data;
step c: constructing a classification model based on the bidirectional flow feature set, and outputting a classification result of the network traffic data through the classification model;
the step b further comprises the following steps: optimizing the bidirectional flow feature set by using a maximum variance interpretation mechanism;
the optimizing the bidirectional flow feature set by using the maximum variance interpretation mechanism specifically comprises:
step b 5: performing standard normalization on the network traffic data;
step b 6: on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
step b 7: subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
step b 8: calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
step b 9: calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
step b 10: projecting the network traffic data onto the N eigenvectors;
step b 11: and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
2. The method for classifying network traffic according to claim 1, wherein in the step a, the collecting network traffic data and labeling the network traffic data specifically include:
step a 1: selecting an application category in the network traffic;
step a 2: collecting a network flow data packet corresponding to each application and a system network log of a corresponding time period;
step a 3: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications;
step a 4: and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
3. The method according to claim 2, wherein in the step b, the extracting a bidirectional flow feature set according to the labeled network traffic data specifically includes:
step b 1: analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
step b 2: finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
step b 3: finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
step b 4: and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
4. A network traffic classification system, comprising:
a data acquisition module: the system is used for collecting network flow data;
a data preprocessing module: the system is used for labeling the network flow data;
a feature extraction module: the bidirectional flow characteristic set is used for extracting a bidirectional flow characteristic set according to the network flow data subjected to the labeling processing;
a model construction module: the bidirectional flow feature set is used for constructing a classification model based on the bidirectional flow feature set, and a classification result of the network flow data is output through the classification model;
the system also comprises a feature optimization module, wherein the feature optimization module is used for optimizing the bidirectional flow feature set by utilizing a maximum variance interpretation mechanism;
the feature optimization module specifically optimizes the bidirectional flow feature set by using a maximum variance interpretation mechanism, and comprises the following steps:
performing standard normalization on the network traffic data;
on the network flow data, calculating the average value of each feature on the bidirectional flow feature set;
subtracting the average value corresponding to each feature from the normalized network flow data to obtain a new result of each feature, and performing variance normalization on the new result of each feature;
calculating a covariance matrix of the bidirectional flow feature set, and sequencing the features from small to large according to the variance value of each feature on a main diagonal in the covariance matrix to obtain the N-dimensional features with the highest and closest association degree in the bidirectional flow feature set;
calculating eigenvalues and eigenvectors of the covariance matrix, sorting the eigenvalues according to sizes, and selecting eigenvectors corresponding to the first N optimized bidirectional flow characteristics;
projecting the network traffic data onto the N eigenvectors;
and optimizing the M-dimensional bidirectional flow feature set of the network traffic data into an N-dimensional bidirectional flow feature set.
5. The network traffic classification system of claim 4,
the data acquisition module specifically acquires network traffic data and comprises: selecting application types in network flow, and collecting a network flow data packet corresponding to each application and a system network log corresponding to a time period;
the data preprocessing module is used for labeling the network traffic data and specifically comprises the following steps: analyzing the network flow data packet, and finding out the natural attribute of each application and the IP address and the transmission protocol communicated with other applications; and extracting the IP end points and the transmission packet number associated with each application in the system network log, and performing association fusion by combining an IP address and a transmission protocol to finish the labeling processing of the network flow data.
6. The network traffic classification system according to claim 5, wherein the extracting, by the feature extraction module, the bidirectional flow feature set according to the labeled network traffic data specifically includes:
analyzing according to the labeled network traffic data, and respectively counting bidirectional network flow information between each pair of { source IP address, destination IP address } and { destination IP address- > source IP address } based on different port numbers in the network traffic data;
finding out forward network flows between each pair of { source IP address- > destination IP address }, and extracting all forward network flow feature sets from the forward network flows;
finding out reverse network flows between each pair of { destination IP address- > source IP address }, and extracting all reverse network flow feature sets from the reverse network flows;
and combining the forward and reverse network flow feature sets between each pair of the { source IP address and the destination IP address } to form a bidirectional flow feature set of the M-dimensional features.
7. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of classifying network traffic of any of claims 1 to 3.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811113686.XA CN109309630B (en) | 2018-09-25 | 2018-09-25 | Network traffic classification method and system and electronic equipment |
PCT/CN2018/112401 WO2020062390A1 (en) | 2018-09-25 | 2018-10-29 | Network traffic classification method and system, and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811113686.XA CN109309630B (en) | 2018-09-25 | 2018-09-25 | Network traffic classification method and system and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109309630A CN109309630A (en) | 2019-02-05 |
CN109309630B true CN109309630B (en) | 2021-09-21 |
Family
ID=65225067
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811113686.XA Active CN109309630B (en) | 2018-09-25 | 2018-09-25 | Network traffic classification method and system and electronic equipment |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109309630B (en) |
WO (1) | WO2020062390A1 (en) |
Families Citing this family (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110097120B (en) * | 2019-04-30 | 2022-08-26 | 南京邮电大学 | Network flow data classification method, equipment and computer storage medium |
CN110149280B (en) * | 2019-05-27 | 2020-08-28 | 中国科学技术大学 | Network traffic classification method and device |
CN110365603A (en) * | 2019-06-28 | 2019-10-22 | 西安交通大学 | A kind of self adaptive network traffic classification method open based on 5G network capabilities |
CN111698223B (en) * | 2020-05-22 | 2022-02-22 | 哈尔滨工程大学 | Encrypted WEB fingerprint identification method based on automatic feature engineering |
CN113746686A (en) * | 2020-05-27 | 2021-12-03 | 阿里巴巴集团控股有限公司 | Network flow state determination method, computing device and storage medium |
CN111817971B (en) * | 2020-06-12 | 2023-03-24 | 华为技术有限公司 | Data center network flow splicing method based on deep learning |
CN111970305B (en) * | 2020-08-31 | 2022-08-12 | 福州大学 | Abnormal flow detection method based on semi-supervised descent and Tri-LightGBM |
CN112448868B (en) * | 2020-12-02 | 2022-09-30 | 新华三人工智能科技有限公司 | Network traffic data identification method, device and equipment |
CN112804253B (en) * | 2021-02-04 | 2022-07-12 | 湖南大学 | Network flow classification detection method, system and storage medium |
CN112839055B (en) * | 2021-02-04 | 2022-08-23 | 北京六方云信息技术有限公司 | Network application identification method and device for TLS encrypted traffic and electronic equipment |
CN113098735B (en) * | 2021-03-31 | 2022-10-11 | 上海天旦网络科技发展有限公司 | Inference-oriented application flow and index vectorization method and system |
CN113114672B (en) * | 2021-04-12 | 2023-02-28 | 常熟市国瑞科技股份有限公司 | Video transmission data fine measurement method |
CN113141357B (en) * | 2021-04-19 | 2022-02-18 | 湖南大学 | Feature selection method and system for optimizing network intrusion detection performance |
CN112995063B (en) * | 2021-04-19 | 2021-10-08 | 北京智源人工智能研究院 | Flow monitoring method, device, equipment and medium |
CN113315721B (en) * | 2021-05-26 | 2023-01-17 | 恒安嘉新(北京)科技股份公司 | Network data feature processing method, device, equipment and storage medium |
CN113556317B (en) * | 2021-06-07 | 2022-10-11 | 中国科学院信息工程研究所 | Abnormal flow detection method and device based on network flow structural feature fusion |
CN114928560B (en) * | 2022-05-16 | 2023-01-31 | 珠海市鸿瑞信息技术股份有限公司 | Big data based network flow and equipment log cooperative management system and method |
CN115484087A (en) * | 2022-09-07 | 2022-12-16 | 南京邮电大学 | Embedded equipment service identification system |
WO2024065185A1 (en) * | 2022-09-27 | 2024-04-04 | 西门子股份公司 | Device classification method and apparatus, electronic device, and computer-readable storage medium |
CN116647877B (en) * | 2023-06-12 | 2024-03-15 | 广州爱浦路网络技术有限公司 | Flow category verification method and system based on graph convolution model |
CN116662817B (en) * | 2023-07-31 | 2023-11-24 | 北京天防安全科技有限公司 | Asset identification method and system of Internet of things equipment |
CN117221242A (en) * | 2023-09-01 | 2023-12-12 | 安徽慢音科技有限公司 | Network flow direction identification method, device and medium |
CN117197591B (en) * | 2023-11-06 | 2024-03-12 | 青岛创新奇智科技集团股份有限公司 | Data classification method based on machine learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102394827A (en) * | 2011-11-09 | 2012-03-28 | 浙江万里学院 | Hierarchical classification method for internet flow |
CN104052639A (en) * | 2014-07-02 | 2014-09-17 | 山东大学 | Real-time multi-application network flow identification method based on support vector machine |
CN106874879A (en) * | 2017-02-21 | 2017-06-20 | 华南师范大学 | Handwritten Digit Recognition method based on multiple features fusion and deep learning network extraction |
CN107967311A (en) * | 2017-11-20 | 2018-04-27 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus classified to network data flow |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7827011B2 (en) * | 2005-05-03 | 2010-11-02 | Aware, Inc. | Method and system for real-time signal classification |
CN103973589B (en) * | 2013-09-12 | 2017-04-12 | 哈尔滨理工大学 | Network traffic classification method and device |
CN104767692B (en) * | 2015-04-15 | 2018-05-29 | 中国电力科学研究院 | A kind of net flow assorted method |
CN106487535B (en) * | 2015-08-24 | 2020-04-28 | 中兴通讯股份有限公司 | Method and device for classifying network traffic data |
US10785247B2 (en) * | 2017-01-24 | 2020-09-22 | Cisco Technology, Inc. | Service usage model for traffic analysis |
SG10201913257UA (en) * | 2017-03-02 | 2020-02-27 | Univ Singapore Technology & Design | Method and apparatus for determining an identity of an unknown internet-of-things (iot) device in a communication network |
-
2018
- 2018-09-25 CN CN201811113686.XA patent/CN109309630B/en active Active
- 2018-10-29 WO PCT/CN2018/112401 patent/WO2020062390A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102394827A (en) * | 2011-11-09 | 2012-03-28 | 浙江万里学院 | Hierarchical classification method for internet flow |
CN104052639A (en) * | 2014-07-02 | 2014-09-17 | 山东大学 | Real-time multi-application network flow identification method based on support vector machine |
CN106874879A (en) * | 2017-02-21 | 2017-06-20 | 华南师范大学 | Handwritten Digit Recognition method based on multiple features fusion and deep learning network extraction |
CN107967311A (en) * | 2017-11-20 | 2018-04-27 | 阿里巴巴集团控股有限公司 | A kind of method and apparatus classified to network data flow |
Also Published As
Publication number | Publication date |
---|---|
CN109309630A (en) | 2019-02-05 |
WO2020062390A1 (en) | 2020-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109309630B (en) | Network traffic classification method and system and electronic equipment | |
WO2017124990A1 (en) | Method, system, device and readable storage medium for realizing insurance claim fraud prevention based on consistency between multiple images | |
WO2021037280A2 (en) | Rnn-based anti-money laundering model training method, apparatus and device, and medium | |
US8873840B2 (en) | Reducing false detection rate using local pattern based post-filter | |
CN113435546B (en) | Migratable image recognition method and system based on differentiation confidence level | |
CN109525508B (en) | Encrypted stream identification method and device based on flow similarity comparison and storage medium | |
CN114492768B (en) | Twin capsule network intrusion detection method based on small sample learning | |
CN109639734B (en) | Abnormal flow detection method with computing resource adaptivity | |
WO2020155790A1 (en) | Method and apparatus for extracting claim settlement information, and electronic device | |
US20230215125A1 (en) | Data identification method and apparatus | |
WO2022199185A1 (en) | User operation inspection method and program product | |
CN110458078A (en) | A kind of face image data clustering method, system and equipment | |
US10423817B2 (en) | Latent fingerprint ridge flow map improvement | |
CN111191720B (en) | Service scene identification method and device and electronic equipment | |
CN114553591B (en) | Training method of random forest model, abnormal flow detection method and device | |
CN115600128A (en) | Semi-supervised encrypted traffic classification method and device and storage medium | |
CN116662817B (en) | Asset identification method and system of Internet of things equipment | |
WO2019100348A1 (en) | Image retrieval method and device, and image library generation method and device | |
CN117375896A (en) | Intrusion detection method and system based on multi-scale space-time feature residual fusion | |
Machado et al. | Improving face detection | |
CN116109864A (en) | Garment detection and identification method, device, terminal and computer readable storage medium | |
CN114444514A (en) | Semantic matching model training method, semantic matching method and related device | |
WO2019129293A1 (en) | Feature data generation method and apparatus and feature matching method and apparatus | |
CN106530199A (en) | Multimedia integrated steganography analysis method based on window hypothesis testing | |
CN112417446A (en) | Software defined network anomaly detection architecture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |