CN111786903B - Network traffic classification method based on constrained fuzzy clustering and particle computation - Google Patents

Network traffic classification method based on constrained fuzzy clustering and particle computation Download PDF

Info

Publication number
CN111786903B
CN111786903B CN202010465413.2A CN202010465413A CN111786903B CN 111786903 B CN111786903 B CN 111786903B CN 202010465413 A CN202010465413 A CN 202010465413A CN 111786903 B CN111786903 B CN 111786903B
Authority
CN
China
Prior art keywords
flow
traffic
network
network traffic
particle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010465413.2A
Other languages
Chinese (zh)
Other versions
CN111786903A (en
Inventor
靖旭阳
赵晶晶
闫峥
维托尔德·佩德里茨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010465413.2A priority Critical patent/CN111786903B/en
Publication of CN111786903A publication Critical patent/CN111786903A/en
Application granted granted Critical
Publication of CN111786903B publication Critical patent/CN111786903B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2441Traffic characterised by specific attributes, e.g. priority or QoS relying on flow classification, e.g. using integrated services [IntServ]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention belongs to the technical field of network traffic classification, and discloses a network traffic classification method based on constrained fuzzy clustering and particle computation, wherein in a training stage, a data set with a label and a data set without the label are merged by using traffic information; operating the merged data set through the CCFCM, and outputting a group of clustering centers in a numerical format; constructing network flow particles around a numerical clustering center, and continuously optimizing under the guidance of a reasonable granularity criterion; each flow particle is mapped to a corresponding flow category by means of the flow with the mark; extracting data packet level and stream level characteristics from the NTG, and constructing a classification rule base; in the test phase, the particle classifier identifies new network flows or network anomalies by means of classification rules. Since the network traffic particles can describe the potential structure of the traffic data in detail, the classification accuracy of the traffic can be greatly improved.

Description

Network traffic classification method based on constrained fuzzy clustering and particle computation
Technical Field
The invention belongs to the technical field of network traffic classification, and particularly relates to a network traffic classification method based on constrained fuzzy clustering and particle computation.
Background
Currently, network traffic classification aims at identifying the categories of traffic generated by different applications and protocols, which can provide network administrators with a fine-grained or coarse-grained view of network conditions, such as quality of service measurements, resource allocation, and intrusion detection, thereby helping them manage the network conveniently. With the advent of more and more new types of network services and network access devices, network traffic classification has attracted more and more attention to managing networks in an intelligent manner.
The current flow classification methods are mainly divided into five types: the first is correlation-based classification, which aggregates traffic based on its correlation and then uses some machine learning algorithm on the aggregated traffic. The second is a feature-based classification algorithm that performs classification by analyzing flow-based or packet-based statistical features. A third method is a behavior-based classification algorithm that uses the interaction behavior of a host to determine the roles of the host in the network and then classifies based on the behavior of those roles. The fourth is a port-based classification method that identifies the corresponding traffic by examining standard ports used by well-known applications. The last method is a packet payload based classification method that uses deep packet inspection techniques to match the signature of the application/protocol in the payload.
There are some problems with these above methods. First, most methods suffer from misclassification of unknown traffic. They cannot identify unknown traffic classes during the training phase, thereby attributing them to known traffic classes. This will affect the accuracy of the classification to a large extent. A second problem is that some methods are not always reliable, for example, load-based classification methods become ineffective in processing encrypted data; port-based classification approaches also become ineffective in the face of dynamic port mechanisms. A third problem is that they cannot be used in conjunction with packet-level and flow-level features to perform traffic classification. Their classification rules are based on packet-level or flow-level traffic characteristics. Some well-designed traffic (e.g., anomalous traffic generated by stealth distributed denial of service attacks) will become invalid when handled. Therefore, in current network management, such as network anomaly detection and network visualization, converged different traffic level information is required. In order to overcome the above problems, it is urgently needed to develop a new flow classification method having the following requirements: high accuracy, the ability to identify unknown traffic classes, the use of different traffic level features to elaborate potential traffic data structures in fine granularity, may address problems caused by training data shortages.
The invention relates to a classification method based on correlation. Similar to most correlation-based classification methods, the present invention exploits the correlation between data streams to increase the accuracy of the classification. However, it should be noted that the existing classification schemes based on correlation have some problems discussed above, such as the inability to accurately identify unknown traffic; the traffic information of different levels is not fully utilized, etc.
Through the above analysis, the problems and defects of the prior art are as follows:
(1) most methods suffer from misclassification of unknown traffic. They cannot identify unknown traffic classes during the training phase, thereby attributing them to known traffic classes. This will affect the accuracy of the classification to a large extent;
(2) some methods are not always reliable. When the network fluctuates or the network environment changes, the accuracy of most methods becomes low;
(3) they cannot be used in conjunction with packet-level and flow-level features to perform traffic classification.
The difficulty in solving the above problems and defects is: although there are many methods that are continuously trying to improve the accuracy and reliability of classification, reliable and stable traffic classification still faces many difficulties. Firstly, due to the continuous development of networks, more and more applications bring massive data traffic, and many unknown traffic and even malicious traffic are mixed in the data traffic, which brings great difficulty to classification. Second, the collection of data sets and labels is also a difficult problem for traffic classification. How to obtain a large amount of real and reliable network traffic and correct data tags without violating user privacy still requires further research.
The significance of solving the problems and the defects is as follows: solving these problems can help network users get better service, and a more secure network environment. The flow data structure can be described more accurately by helping a researcher to improve the classification fine granularity on the basis of further improving the classification accuracy.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a network flow classification method based on constrained fuzzy clustering and particle computation.
The invention is realized in such a way that the network flow classification method based on the constrained fuzzy clustering and the particle calculation comprises the following steps:
in the training phase, merging the labeled data set with the unlabeled data set by using the flow information;
operating the merged data set through the CCFCM, and outputting a group of clustering centers in a numerical format;
constructing network flow particles around a numerical clustering center, and continuously optimizing under the guidance of a reasonable granularity criterion;
each flow particle is mapped to a corresponding flow category by means of the flow with the mark;
extracting data packet level and stream level characteristics from the NTG, and constructing a classification rule base;
in the test phase, the particle classifier identifies new network flows or network anomalies by means of classification rules.
Further, the method for constructing the network traffic particles comprises the following steps:
clustering network flow by using constraint fuzzy clustering;
step two, establishing network flow information particles and constructing a particle size classifier according to a clustering result;
and step three, classifying the network traffic based on the granularity classifier.
Further, in the step one,
first, network traffic is collected, and the following features are extracted from each flow: stream size, stream interval, maximum, minimum, mean and standard deviation of data packet size, maximum, minimum, mean and standard deviation of data packet arrival interval, number of bytes transmitted; given a label L ═ L1,l2,...,lnSet of data streams S ═ S1,s2,...,snIn which s isi∈RqQ is the feature dimension, li∈(Class1,Class2,...ClassK),i∈[1,n],Classp(p∈[1,K]) Is a traffic class, for another unlabeled dataset T ═ T1,t2,...};
Secondly, clustering network flow, and guiding the direction of membership degree change in the clustering process by using an enhancement coefficient, wherein the enhancement coefficient is calculated only by the ratio of data streams containing must-link in each cluster, and the target function of CCFCM is as follows:
Figure BDA0002512464830000041
where m is the blurring coefficient, c is the number of clusters, N is the number of data streams, 0 ≦ uik1 is data stream xkFor the clustering center viAnd must satisfy
Figure BDA0002512464830000042
Is the standard Euclidean distance, βikIs the enhancement factor;
βikthe calculation formula of (a) is as follows:
Figure BDA0002512464830000043
wherein RLlIs a related subset of streams in RLS, Card (C)i) Is in the ith cluster with xkNumber of data streams for the most-link relationship, Card (RL)l) Is RLlThe number of data streams in.
Further, in step two, a cluster center v represented by numerical data is surrounded1,v2,...,vcConstructing network traffic particles, and expressing the traffic particles as NTG ═ G1,G2,...,GcIn which G isi={Gi1,Gi2,...,GiqQ is the flow characteristic dimension, i 1, 2.
Further, the construction process of the flow particles is as follows:
(1) generating network traffic particles, generating network traffic particles using an epsilon-information particle rule, network traffic particle GiStructurally similar to a hypercube structure, with each dimension calculated by Gij=[vij-ε/2*rangej,vij+ε/2*rangej]Wherein v isijIs a numerical center viThe j-th dimension has a value i 1,2,., c, j 1,2,., q, e, GiSize of (1), rangejIs the value variation range of the j dimension of the original data value;
(2) reconstructing raw data points from network traffic particles, from NTG { G ═ G1,G2,...,GcReconstructed data point x inkThe reconstructed data is an interval value expressed as
Figure BDA0002512464830000044
k 1,2,. N; if it is not
Figure BDA0002512464830000045
Then this network traffic particle pair xkThe expression ability of (a) is considered to be good;
(3) and optimizing the network traffic particles.
Further, the method for calculating the membership degree comprises the following steps:
(1) if xk G i1,2, c, then xkFor GiDegree of membership of
Figure BDA0002512464830000051
The membership degree to other flow particles is
Figure BDA0002512464830000052
Using GiIt is shown that,
Figure BDA0002512464830000053
(2) if it is not
Figure BDA00025124648300000511
Performing deblurring operation through membership degree aggregation and constructed network flow information particles to calculate
Figure BDA0002512464830000054
The calculation formula of (a) is as follows:
Figure BDA0002512464830000055
wherein
Figure BDA0002512464830000056
And
Figure BDA0002512464830000057
is GiLower and upper bounds. For the j-th dimension, the number of the dimensions,
Figure BDA0002512464830000058
the calculation formula of (2) is as follows:
Figure BDA0002512464830000059
wherein
Figure BDA00025124648300000510
Further, in step three, the classification is judged based on the data points contained in the granules, if a flow granule does not contain any flow with a label, the flow granule is regarded as an unknown classification, and the flow granule is not used in the classification stage; for a grain comprising at least one tagged stream, assigning its traffic class based on a comparison of the number of streams in which the different tags are present;
setting a rule base of a packet level and a flow level, a classification rule of a single flow extracted from each network traffic particle, and an identification rule of an application program behavior extracted from each traffic class; for a single flow class, the distance between the flow and each network traffic particle is calculated.
Further, for a new flow y ∈ RqThe classification is carried out according to the following steps:
if y ∈ Gi(i ═ 1, 2.., r), i.e., yjAt the flow rate of particles Gij(j ═ 1, 2.., q) internal, where Gi∈TCp(p∈[1,K]) Then we label stream y as Classp
If
Figure BDA0002512464830000061
We need to calculate the flow y and the flow particles Gi(i ═ 1, 2.., r), and subjecting the same to a temperature reduction treatmentDivided to include G nearest to stream yiSet category of TCp(p∈[1,K]) In (1).
Another objective of the present invention is to provide a network anomaly detection and prevention method for implementing a network traffic classification method based on constrained fuzzy clustering and particle computation.
Another object of the present invention is to provide a big data analysis method implementing a network traffic classification method based on constrained fuzzy clustering and particle computation.
By combining all the technical schemes, the invention aims to classify the network traffic from the granularity perspective, which is a new calculation method for information processing, and the invention has the advantages and positive effects that:
(1) effectiveness: the present invention is effective because it uses the relevant traffic information as a priori knowledge of the clustered network traffic. The designed "custom constrained fuzzy C-means" (CCFCM) algorithm is a semi-supervised learning method using machine learning, which incorporates a priori knowledge to obtain results close to the user's expectations. In CCFCM, the invention adjusts membership by taking into account the ratio of must-link data points, using an enhancement factor. Compared with other constrained fuzzy C-means algorithms, the updating process of the membership matrix and the clustering center of the CCFCM is simple and effective.
(2) The accuracy is as follows: the optimization process of the network traffic particles makes the classification method of the invention more accurate than other existing methods. In order to mine the infrastructure of the traffic data, the invention firstly uses CCFCM to obtain rough description, and then uses an optimization rule of a reasonable granularity principle to construct network traffic particles based on a clustering result. The dual mining step may fully describe the structure of the traffic data and make the description more specific. The invention improves the accuracy of the particle classifier.
(3) Robustness: this patent is robust because it has the ability to identify unknown traffic classes. The discovery of unknown traffic classes is important because the number of unknown classes can greatly affect the accuracy of traffic classification. If unknown traffic cannot be identified and wrongly classified into known classes, the performance of the classifier will be degraded. The method of the present invention can accurately find unknown flows because it can provide a detailed description of the flow structure. This property makes the classification method of the present invention more robust in the presence of unknown traffic.
(4) And (3) the versatility is as follows: the present invention is general in that it can perform traffic classification and anomaly detection. Both of these functions are highly desirable in current flow measurements. The invention extracts packet-level and flow-level features from network traffic particles to build two rule bases to classify traffic classes and identify application behaviors. By adjusting the rule base, more flow measurement functions can be realized. Obviously, the classification method of the invention can be applied to discovering abnormal network behaviors, thereby being used as an effective network intrusion and threat detection method.
(5) And (3) expandability: the method of the present invention can easily extend this patent to accomplish the task of network security measurement. The present invention provides a new method for analyzing and modeling network data by innovating CCFCM and using the theory of granular computing. The present invention may use similar principles to process more types of traffic data and other types of data, for example, analyzing abnormal traffic data to perform network anomaly detection.
(6) Since the network traffic particles can describe the potential structure of the traffic data in detail, the classification accuracy of the traffic can be greatly improved.
(7) In the prior art, classification rules are directly established on a cluster center, wherein outliers are contained, and thus the classification is not accurate enough. The accuracy of the prior art classification depends on the quality of the clusters. The present invention constructs network traffic particles around the cluster center and classifies traffic using rules extracted from each traffic particle.
(8) The extraction rule of the invention is more specific, and the classification accuracy is greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.
Fig. 1 is an exemplary diagram of a CCFCM provided by an embodiment of the present invention.
Fig. 2 is a network traffic information particle diagram according to an embodiment of the present invention.
Fig. 3 is a flowchart of a network traffic classification method based on constrained fuzzy clustering and particle computation according to an embodiment of the present invention.
Fig. 4 is a diagram illustrating specific steps performed in an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The present invention proposes the concept of network traffic particles and aims to build a particle-based classifier for classifying network traffic. Grain and particle size are concepts derived from grain calculations. It is a constantly evolving and powerful theory for solving complex problems, large-scale data mining and fuzzy information processing. In the invention, a novel clustering algorithm of custom Constrained Fuzzy C-Means (CCFCM) is designed. The algorithm incorporates a priori knowledge about traffic information to enhance the accuracy of network traffic clustering. A priori knowledge of the traffic information is the necessary linkage between related streams, i.e. two data streams must be clustered into the same cluster. Related flows are flows with the same destination address, destination port and transport protocol. In consideration of the stability of the network environment, they are generated by the same application for a certain short time. After clustering the flow, a plurality of cluster centers represented by numerical values can be obtained. The present invention constructs network traffic particles around these numerically represented cluster centers to improve the quality of the traffic data structure description. Since the construction of network traffic particles is an optimization problem, the particles are continuously optimized and updated according to particle optimization criteria, in order to describe the traffic structure in detail. Within each network traffic particle, the present invention extracts packet-level and flow-level rules to build different particle classifiers. By combining two-stage flow rules, the classifier of the invention can effectively and accurately classify various flow categories so as to further detect network abnormality. The present invention will be described in detail below with reference to the accompanying drawings.
The invention can realize two important functions of network flow measurement: traffic class classification and anomaly detection. In the training phase, the present invention uses relevant traffic information to merge labeled and unlabeled data sets. The invention can not only find the related stream, but also expand the training data set. The CCFCM operates on the merged dataset and outputs a set of cluster centers in numerical format. Network traffic particles (NTGs) are then constructed around the cluster centers of these numerical types. They will continually optimize under the direction of reasonable granularity criteria. This step is the core of the classification method of the present invention. After the optimization process is completed, the present invention will obtain the best NTG. With the marked streams, each traffic particle will be mapped to a corresponding traffic class. Packet-level and flow-level features are extracted from the NTG to build a classification rule base. In the test phase, the particle classifier identifies new network flows or network anomalies by means of classification rules.
The network flow classification method based on the constrained fuzzy clustering and the particle calculation comprises the following steps:
step one, clustering network flow by using constraint fuzzy clustering.
And step two, establishing network flow information particles and constructing a particle size classifier according to the clustering result.
And step three, classifying the network traffic based on the granularity classifier.
The first step specifically comprises:
(1) collecting network flow: a flow is defined as a collection of packets having one or more of the same attributes. These same attributes, commonly referred to as stream keys, typically include packet header information, packet content, and meta-information. A flow can summarize network traffic information more than a packet. The invention selects the flow key words as source IP address, target IP address, source port, target port and transmission protocol. The present invention will extract the following features from each stream: flow size (number of packets); flow interval (the time interval from the arrival of the first packet to the expiration of the flow); maximum, minimum, mean and standard deviation of packet size; maximum, minimum, mean and standard deviation of packet arrival interval; the number of bytes transferred.
Related flows are flows with the same destination address, destination port and transport protocol. Because the host providing some applications will not change its services in a short time, the related streams are generated by the same application. Therefore, in the clustering process, related streams should be clustered into the same cluster. In addition, there is no cannot-link constraint relationship in the network traffic information, because the present invention cannot determine whether two flows belong to different traffic classes according to the header information.
Given a label L ═ L1,l2,...,lnSet of data streams S ═ S1,s2,...,snIn which s isi∈RqQ is the feature dimension, li∈(Class1,Class2,...ClassK),i∈[1,n],Classp(p∈[1,K]) Is the traffic class. For another unmarked dataset T ═ T1,t2,., the present invention solves the problem of training data set shortage by using relevant stream information to combine S and T to obtain a larger training data set. This combined training data set comprises two parts, the correlation stream set RLS ═ RL1,RL2,.. } and a separate stream set IS. The invention endows the data streams in the same related stream set with a must-link relation. And, if RLlIncluding tagged streams, the present invention maps RL to a more compact formatlAll flows in (a) are classified into the category to which the flows with labels belong.
(2) Clustering network flow: the goal of custom constrained fuzzy C-means (CCFCM) is to separate the data stream as much as possible into the most clusters that contain data streams with a must-link relationship to the data stream during the clustering process. The invention uses an enhancement coefficient to guide the direction of membership change in the clustering process. The enhancement factor is calculated only by the ratio of the data streams containing the must-link in each cluster.
The target function of CCFCM is:
Figure BDA0002512464830000101
where m is the blurring coefficient, c is the number of clusters, N is the number of data streams, 0 ≦ u ik1 is data stream xkFor the clustering center viAnd must satisfy
Figure BDA0002512464830000102
I | - | is the standard Euclidean distance, βikIs the enhancement factor.
βikThe calculation formula of (a) is as follows:
Figure BDA0002512464830000103
wherein RLlIs a related subset of streams in RLS, Card (C)i) Is in the ith cluster with xkNumber of data streams for the most-link relationship, Card (RL)l) Is RLlThe number of data streams in the stream, satisfies Card (C)i)≤Card(RLl).
βikHas the effect of increasing the data stream xkMembership to a cluster containing a stream of data with a must-link relationship therewith, namely zoom-in xkDistance from the cluster. An example is given in fig. 1. V for two cluster centers1And v2Their members are represented by black and green dots. Green dot and xkHave a relationship of must-link therebetween. The invention can see a cluster v1The number of green dots in is more than v2Number of green dots in (1). Thus, the present invention increases xkAnd v1The degree of membership of (a) is increased to close the distance between them.
Fig. 1 is an exemplary diagram of CCFCM, and the update formula of the membership and the cluster center of CCFCM is as follows:
Figure BDA0002512464830000111
Figure BDA0002512464830000112
the second step specifically comprises:
after CCFCM is completed, the invention can obtain the clustering center represented by numerical data, namely v1,v2,...,vc. Network traffic particles can be constructed around the clustering centers of the numerical types. The present invention expresses the flow particles as NTG ═ G1,G2,...,GcIn which G isi={Gi1,Gi2,...,GiqQ is the flow characteristic dimension, i 1, 2. The construction of the flow particles is as follows.
(1) Generating network traffic particles: the purpose of this step is to construct a flow information particle based on the original numerical center. The present invention uses the epsilon-information granule rule to generate network traffic granules. Network traffic particle GiStructurally similar to a hypercube structure, with each dimension calculated by Gij=[vij-ε/2*rangej,vij+ε/2*rangej]Wherein v isijIs a numerical center viThe j-th dimension has a value i 1,2,., c, j 1,2,., q, e, GiSize of (1), rangejIs the range of the original data value in the jth dimension (i.e., the difference between the maximum and minimum values in the jth dimension).
To refine GijThe invention first finds out GijMiddle distance viThe furthest point. The invention then deletes the farthest data points and the empty regions between the boundaries so that these points are located at the boundaries. Fig. 2 shows the format of a two-dimensional traffic information particle. The dotted line boundary is the original network traffic particle and the solid line boundary is the compressed network traffic particle. The present invention can compress the size of the flow particles by discarding unused areas. The compression process makes the netThe net flow particle size is more specific.
Furthermore, it can be seen from fig. 2 that a network traffic particle is actually a collection of a series of data points. Using particles for representation, the present invention can represent flow information in a readable manner using fewer data points.
(2) Reconstructing an original data point based on the network traffic particles: the purpose of this step is to reconstruct the original data points from the network traffic particles in order to better test their representational capacity. For example, the present invention contemplates that NTG ═ G is selected from1,G2,...,GcReconstructed data point x inkThe reconstructed data is an interval value expressed as
Figure BDA0002512464830000121
k 1, 2. If it is not
Figure BDA0002512464830000122
Then this network traffic particle pair xkThe expression ability of (A) can be considered as good. To calculate
Figure BDA0002512464830000123
The present invention requires to know xkFor GiI-1, 2, a, c. There are two ways to calculate the degree of membership.
1) If xk G i1,2, c, then xkFor GiDegree of membership of
Figure BDA0002512464830000124
The membership degree to other flow particles is
Figure BDA0002512464830000125
This means that G can be used according to the inventioniDenotes xk. Therefore, when this method is used,
Figure BDA0002512464830000126
2) if it is not
Figure BDA00025124648300001215
The invention carries out deblurring operation through membership aggregation and constructed network flow information particles to calculate
Figure BDA0002512464830000127
The calculation formula of (a) is as follows:
Figure BDA0002512464830000128
wherein
Figure BDA0002512464830000129
And
Figure BDA00025124648300001210
is GiLower and upper bounds. For the j-th dimension, the number of the dimensions,
Figure BDA00025124648300001211
the calculation formula of (2) is as follows:
Figure BDA00025124648300001212
wherein
Figure BDA00025124648300001213
(3) Optimizing network traffic particles: this step is very important because it guides the direction of the construction of the network traffic granularity. The present invention first introduces the principle of reasonable granularity, which is an optimization rule with two performance indicators, namely coverage and specificity.
Coverage-coverage requires that the flow particle contain as many raw data points as possible. The more data points contained in a network traffic particle, the better representation capability of the network traffic particle is represented. Through the reconstruction process of the original data points executed in the second step, the invention can calculate the number of the original data points contained in one network traffic particle. The coverage was calculated as follows:
Figure BDA00025124648300001214
where N is the number of raw data points,
Figure BDA0002512464830000131
is the number of data points contained in its reconstruction. The coverage will become larger as epsilon increases.
Specificity-specificity directs network traffic particles to cover as specific raw data points as possible. This means that the size of the traffic particles should be smaller to obtain a clearly defined semantic for the traffic. The smaller the network traffic particle, the more similar the data points contained therein. Thus, specificity decreases with increasing epsilon. The specific definition is as follows:
Figure BDA0002512464830000132
Figure BDA0002512464830000133
represents
Figure BDA0002512464830000134
The specificity was calculated as follows:
Figure BDA0002512464830000135
Figure BDA0002512464830000136
wherein a isjAnd bjIs the maximum and minimum value of the original data point in the jth dimension.
From the definitions of coverage and specificity, the invention can find that a competitive relationship exists between the two performance indexes. The greater the coverage, the less specificity. Both of these metrics are affected by the size of the network traffic particles epsilon. Therefore, how to balance coverage and specificity to obtain the optimal epsilon is the main motivation for the rational granularity principle. The present invention measures this competition relationship using the following quality assessment:
QA(ε)=Coverage*Specificityα
where α is a non-negative parameter, specificity is more important when α > 1 and coverage is more important when α < 1. When α is 1, coverage is as important as specificity.
The third step specifically comprises:
for each NTG, the present invention first determines its classification based on the data points contained in the grain. If a traffic particle does not contain any tagged flows, the invention considers it as an unknown class and does not use it in the classification phase. All other techniques can be used to analyze it, such as deep packet inspection. For pellets containing at least one tagged stream, the present invention assigns their traffic class based on a comparison of the number of streams in which the different tags are present.
For example, in the flow particle Gi( i 1, 2.., r, r ≦ c), the tagged stream is { lfi1,lfi2,...lfiuTheir corresponding categories are { lli1,lli2,...,lliuH, here llio∈(Class1,Class2,...ClassK) O ═ 1,2,. u. The invention can use the following formula to make the flow grain Gi(denoted as LG)i) Set as Classp(p∈[1,K]):
Figure BDA0002512464830000141
Traffic Class after all traffic particles are labeledp(p∈[1,K]) May be represented by a set of network traffic particle classes, as follows:
TCp={Gi,i=1,2,...,r|LGi=Classp}
next, the present invention may set up a rule base at the packet level and the flow level. The rule base contains two parts: classification rules for individual flows extracted from each network traffic particle, and identification rules for application behavior extracted from each traffic class. For a single flow class, the present invention requires the calculation of the distance between the flow and each network traffic particle. The following is the implementation of the particle classifier.
For a new flow y ∈ RqThe invention classifies the following steps:
if y ∈ Gi(i ═ 1, 2.., r), i.e., yjAt the flow rate of particles Gij(j ═ 1, 2.., q) internal, where Gi∈TCp(p∈[1,K]) Then the invention labels stream y as Classp
If
Figure BDA0002512464830000143
The invention needs to calculate the flow y and the flow grain Gi( i 1, 2.. r), which is divided to include the G nearest to stream yiSet category of TCp(p∈[1,K]) In (1).
The present invention uses dis (y, G)i) To represent y to GiA distance of (1, 2.., r), which is calculated as follows:
Figure BDA0002512464830000142
dis(yj,Gij) Is y and Gi(i ═ 1, 2.., r) in the j-th dimension.
Figure BDA0002512464830000151
Next, the invention first calculates y and Gi(i ═ 1, 2.. times, r) in each traffic class TCp(p∈[1,K]) The shortest distance of (c). Then, for all p traffic classes, the invention labels the class of y as the class TC with the shortest distance to itp(p∈[1,K]). The formula is as follows:
Figure BDA0002512464830000152
the invention provides a novel constrained fuzzy C-means algorithm for clustering flow and solving the problem of undersize training data set. The invention can regard the related traffic information existing in the network traffic as the must-link relation in the semi-supervised learning (i.e. two related traffic must be divided into the same cluster). Therefore, in clustering network traffic, the present invention should keep the relevant flows as close as possible. CCFCM continuously adjusts the degree of membership of a given data point by considering the cardinality of the must-link data point in each cluster, with the relevance information as a priori knowledge. CCFCM is more efficient and faster than other constrained FCM algorithms because it updates the membership matrix by only considering the ratio of data points of the must-link in each cluster, rather than judging the relationships individually.
The invention establishes a novel expression form of network traffic, called network traffic particles. Each network traffic particle is a super-multidimensional dataset that contains many data points. Since its construction is an optimization process, the infrastructure of the traffic data can be fully captured. Traffic particles will provide many benefits, such as identifying incompatible data, reducing the amount of data that needs to be expressed, building a rule base at multiple traffic data levels.
The invention establishes two rule bases for network flow measurement based on network flow particles, namely a packet level rule base and a flow level rule base. Using these rule bases, many functions of network security metrics can be implemented, such as anomalous application detection, malicious traffic identification.
Based on the above discussion, the network traffic classification scheme of the present invention conforms to the following steps. First, the present invention uses the relevant traffic information to find the relevant flows in the training dataset. Then, the traffic data is divided into several cluster classes using CCFCM by taking into account a priori knowledge. After the CCFCM is executed, the invention obtains a group of clustering centers which are expressed in a numerical form. Next, under the optimization guidance of a reasonable granularity principle, the invention constructs network traffic particles around the numerical clustering centers. In this way, the present invention successfully promotes the representation of the traffic data from a numeric format to a granular format. In each traffic particle, the invention extracts packet-level and flow-level features to build two rule bases. Based on these libraries, the granular classifier can identify new flows, which can be used to detect network anomalies for network security measures.
In FIG. 4, first in the training phase, the present invention uses the relevant traffic information to merge labeled data sets with unlabeled data sets. Thus, the present invention can not only find the relevant streams, but also extend the training data set. The CCFCM operates on the merged dataset and outputs a set of cluster centers in numerical format. Network traffic particles (NTGs) are then constructed around the cluster centers of these numerical types. They will continually optimize under the direction of reasonable granularity criteria. This step is the core of the classification method of the present invention. After the optimization process is completed, the present invention will obtain the best NTG. With the marked streams, each traffic particle will be mapped to a corresponding traffic class. Packet-level and flow-level features are extracted from the NTG to build a classification rule base. In the test phase, the particle classifier identifies new network flows by means of classification rules.
In the description of the present invention, "a plurality" means two or more unless otherwise specified; the terms "upper", "lower", "left", "right", "inner", "outer", "front", "rear", "head", "tail", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims (3)

1. A network traffic classification method based on constrained fuzzy clustering and particle computation is characterized in that the network traffic classification method based on constrained fuzzy clustering and particle computation comprises the following steps:
in the training phase, merging the labeled data set with the unlabeled data set by using the flow information;
operating the merged data set through a user-defined constraint fuzzy C mean value CCFCM, and outputting a group of clustering centers in a numerical format;
constructing network flow particles around a numerical clustering center, and continuously optimizing under the guidance of a reasonable granularity criterion;
mapping each obtained best network traffic particle NTG to a corresponding traffic class by means of the flow with the mark;
extracting data packet level and stream level characteristics from the NTG, and constructing a classification rule base;
in the testing stage, the particle classifier identifies new network flow or network abnormality by means of classification rules;
the method for constructing the network traffic particles comprises the following steps:
clustering network flow by using constraint fuzzy clustering;
step two, establishing network flow information particles and constructing a particle size classifier according to a clustering result;
classifying the network traffic based on a granularity classifier;
in the first step, the first step is carried out,
first, network traffic is collected, and the following features are extracted from each flow: stream size, stream interval, maximum, minimum, mean and standard deviation of data packet size, maximum, minimum, mean and standard deviation of data packet arrival interval, number of bytes transmitted; given a value of oneEach label is L ═ L1,l2,...,lnSet of data streams S ═ S1,s2,...,snIn which s isi∈RqQ is the feature dimension, li∈(Class1,Class2,...ClassK),i∈[1,n],Classp(p∈[1,K]) Is a traffic class, for another unlabeled dataset T ═ T1,t2,...};
Secondly, clustering network flow, and guiding the direction of membership degree change in the clustering process by using an enhancement coefficient, wherein the enhancement coefficient is calculated only by the ratio of the must-link data flow contained in each cluster, and the target function of CCFCM is as follows:
Figure FDA0003348241030000021
where m is the blurring coefficient, c is the number of clusters, N is the number of data streams, 0 ≦ uik1 is data stream xkFor the clustering center viAnd must satisfy
Figure FDA0003348241030000022
I | - | is the standard Euclidean distance, βikIs the enhancement factor;
βikthe calculation formula of (a) is as follows:
Figure FDA0003348241030000023
if xk∈RLlAnd Card (C)i)≠0
Wherein RLlIs a related subset of streams in RLS, Card (C)i) Is in the ith cluster with xkNumber of data streams for the most-link relationship, Card (RL)l) Is RLlThe number of data streams;
in step two, a cluster center v represented by numerical data is surrounded1,v2,...,vcConstructing network traffic particles, and expressing the traffic particles as NTG ═ G1,G2,...,GcIn which G isi={Gi1,Gi2,...,GiqQ is a flow characteristic dimension, i 1, 2.., c;
the construction process of the flow particles is as follows:
(1) generating network traffic particles, generating network traffic particles using an epsilon-information particle rule, network traffic particle GiStructurally similar to a hypercube structure, with each dimension calculated by Gij=[vij-ε/2*rangej,vij+ε/2*rangej]Wherein v isijIs a numerical center viThe j-th dimension has a value i 1,2,., c, j 1,2,., q, e, GiSize of (1), rangejIs the value variation range of the j dimension of the original data value;
(2) reconstructing raw data points from network traffic particles, from NTG { G ═ G1,G2,...,GcReconstructed data point x inkThe reconstructed data is an interval value expressed as
Figure FDA0003348241030000024
k 1,2,. N; if it is not
Figure FDA0003348241030000025
Then this network traffic particle pair xkThe expression ability of (a) is considered to be good;
(3) optimizing network traffic particles;
the method for calculating the membership comprises the following steps:
(1) if xk∈Gi1,2, c, then xkFor GiDegree of membership of
Figure FDA0003348241030000031
The membership degree to other flow particles is
Figure FDA0003348241030000032
G ≠ 1, 2., c and G ≠ i, using GiIt is shown that,
Figure FDA0003348241030000033
(2) if it is not
Figure FDA0003348241030000034
i 1, 2.. c, calculating by carrying out deblurring operation on the membership aggregation and the constructed network flow information particles
Figure FDA0003348241030000035
The calculation formula of (a) is as follows:
Figure FDA0003348241030000036
wherein
Figure FDA0003348241030000037
And
Figure FDA0003348241030000038
is GiThe lower and upper bounds, for the jth dimension,
Figure FDA0003348241030000039
the calculation formula of (2) is as follows:
Figure FDA00033482410300000310
wherein
Figure FDA00033482410300000311
In step three, judging the category based on the data points contained in the granules, if a flow granule does not contain any flow with a label, regarding the flow as an unknown category, and not using the flow in the classification stage; for a grain comprising at least one tagged stream, assigning its traffic class based on a comparison of the number of streams in which the different tags are present;
setting a rule base of a packet level and a flow level, a classification rule of a single flow extracted from each network traffic particle, and an identification rule of an application program behavior extracted from each traffic class; for a single flow class, calculating the distance between the flow and each network traffic particle;
for a new flow y ∈ RqThe classification is carried out according to the following steps:
if y ∈ Gi(i ═ 1, 2.., r), i.e., yjAt the flow rate of particles Gij(j ═ 1, 2.., q) internal, where Gi∈TCp(p∈[1,K]) Then the label of stream y is denoted as Classp
If
Figure FDA00033482410300000312
The flow y and the flow particles G need to be calculatedi(i 1, 2.. r), which is divided to include the G nearest to stream yiSet category of TCp(p∈[1,K]) In (1).
2. A network anomaly detection and prevention method for implementing the network traffic classification method based on constrained fuzzy clustering and particle computation of claim 1.
3. A big data analysis method implementing the constrained fuzzy clustering and particle computation based network traffic classification method of claim 1.
CN202010465413.2A 2020-05-28 2020-05-28 Network traffic classification method based on constrained fuzzy clustering and particle computation Active CN111786903B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010465413.2A CN111786903B (en) 2020-05-28 2020-05-28 Network traffic classification method based on constrained fuzzy clustering and particle computation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010465413.2A CN111786903B (en) 2020-05-28 2020-05-28 Network traffic classification method based on constrained fuzzy clustering and particle computation

Publications (2)

Publication Number Publication Date
CN111786903A CN111786903A (en) 2020-10-16
CN111786903B true CN111786903B (en) 2022-02-25

Family

ID=72753907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010465413.2A Active CN111786903B (en) 2020-05-28 2020-05-28 Network traffic classification method based on constrained fuzzy clustering and particle computation

Country Status (1)

Country Link
CN (1) CN111786903B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101252541B (en) * 2008-04-09 2011-05-04 中国科学院计算技术研究所 Method for establishing network flow classified model and corresponding system thereof
US8817655B2 (en) * 2011-10-20 2014-08-26 Telefonaktiebolaget Lm Ericsson (Publ) Creating and using multiple packet traffic profiling models to profile packet flows
CN106452868B (en) * 2016-10-12 2019-04-05 中国电子科技集团公司第三十研究所 A kind of network flow statistic implementation method for supporting various dimensions polymerization classification
CN109726744B (en) * 2018-12-14 2020-11-10 深圳先进技术研究院 Network traffic classification method
CN109981474A (en) * 2019-03-26 2019-07-05 中国科学院信息工程研究所 A kind of network flow fine grit classification system and method for application-oriented software
CN110311829B (en) * 2019-05-24 2021-03-16 西安电子科技大学 Network traffic classification method based on machine learning acceleration
CN110572382B (en) * 2019-09-02 2021-05-18 西安电子科技大学 Malicious flow detection method based on SMOTE algorithm and ensemble learning
CN110765329B (en) * 2019-10-28 2022-09-23 北京天融信网络安全技术有限公司 Data clustering method and electronic equipment

Also Published As

Publication number Publication date
CN111786903A (en) 2020-10-16

Similar Documents

Publication Publication Date Title
CN108900432B (en) Content perception method based on network flow behavior
Zhang et al. PCCN: parallel cross convolutional neural network for abnormal network traffic flows detection in multi-class imbalanced network traffic flows
Qu et al. A survey on the development of self-organizing maps for unsupervised intrusion detection
Janarthanan et al. Feature selection in UNSW-NB15 and KDDCUP'99 datasets
Shi et al. Efficient and robust feature extraction and selection for traffic classification
US9729571B1 (en) System, method, and computer program for detecting and measuring changes in network behavior of communication networks utilizing real-time clustering algorithms
Atli et al. Anomaly-based intrusion detection using extreme learning machine and aggregation of network traffic statistics in probability space
Jha et al. Intrusion detection system using support vector machine
CN109067586B (en) DDoS attack detection method and device
CN112381121A (en) Unknown class network flow detection and identification method based on twin network
Alsaadi et al. Computational intelligence algorithms to handle dimensionality reduction for enhancing intrusion detection system
Atli Anomaly-based intrusion detection by modeling probability distributions of flow characteristics
Kong et al. Identification of abnormal network traffic using support vector machine
CN114172688A (en) Encrypted traffic network threat key node automatic extraction method based on GCN-DL
Jin et al. Mobile network traffic pattern classification with incomplete a priori information
Sankaranarayanan et al. SVM-based traffic data classification for secured IoT-based road signaling system
Zhao et al. A novel network traffic classification approach via discriminative feature learning
CN112055007B (en) Programmable node-based software and hardware combined threat situation awareness method
CN111786903B (en) Network traffic classification method based on constrained fuzzy clustering and particle computation
Wu et al. Quantum walks-based classification model with resistance for cloud computing attacks
Zhang et al. A Step-Based Deep Learning Approach for Network Intrusion Detection.
He et al. A data skew-based unknown traffic classification approach for TLS applications
Zhao et al. Prototype-based malware traffic classification with novelty detection
Alizadeh et al. Timely classification and verification of network traffic using Gaussian mixture models
CN114666273A (en) Application layer unknown network protocol oriented traffic classification method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant