CN111786903B

CN111786903B - Network traffic classification method based on constrained fuzzy clustering and particle computation

Info

Publication number: CN111786903B
Application number: CN202010465413.2A
Authority: CN
Inventors: 靖旭阳; 赵晶晶; 闫峥; 维托尔德·佩德里茨
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2022-02-25
Anticipated expiration: 2040-05-28
Also published as: CN111786903A

Abstract

The invention belongs to the technical field of network traffic classification, and discloses a network traffic classification method based on constrained fuzzy clustering and particle computation, wherein in a training stage, a data set with a label and a data set without the label are merged by using traffic information; operating the merged data set through the CCFCM, and outputting a group of clustering centers in a numerical format; constructing network flow particles around a numerical clustering center, and continuously optimizing under the guidance of a reasonable granularity criterion; each flow particle is mapped to a corresponding flow category by means of the flow with the mark; extracting data packet level and stream level characteristics from the NTG, and constructing a classification rule base; in the test phase, the particle classifier identifies new network flows or network anomalies by means of classification rules. Since the network traffic particles can describe the potential structure of the traffic data in detail, the classification accuracy of the traffic can be greatly improved.

Description

Network traffic classification method based on constrained fuzzy clustering and particle computation

Technical Field

The invention belongs to the technical field of network traffic classification, and particularly relates to a network traffic classification method based on constrained fuzzy clustering and particle computation.

Background

Currently, network traffic classification aims at identifying the categories of traffic generated by different applications and protocols, which can provide network administrators with a fine-grained or coarse-grained view of network conditions, such as quality of service measurements, resource allocation, and intrusion detection, thereby helping them manage the network conveniently. With the advent of more and more new types of network services and network access devices, network traffic classification has attracted more and more attention to managing networks in an intelligent manner.

The current flow classification methods are mainly divided into five types: the first is correlation-based classification, which aggregates traffic based on its correlation and then uses some machine learning algorithm on the aggregated traffic. The second is a feature-based classification algorithm that performs classification by analyzing flow-based or packet-based statistical features. A third method is a behavior-based classification algorithm that uses the interaction behavior of a host to determine the roles of the host in the network and then classifies based on the behavior of those roles. The fourth is a port-based classification method that identifies the corresponding traffic by examining standard ports used by well-known applications. The last method is a packet payload based classification method that uses deep packet inspection techniques to match the signature of the application/protocol in the payload.

There are some problems with these above methods. First, most methods suffer from misclassification of unknown traffic. They cannot identify unknown traffic classes during the training phase, thereby attributing them to known traffic classes. This will affect the accuracy of the classification to a large extent. A second problem is that some methods are not always reliable, for example, load-based classification methods become ineffective in processing encrypted data; port-based classification approaches also become ineffective in the face of dynamic port mechanisms. A third problem is that they cannot be used in conjunction with packet-level and flow-level features to perform traffic classification. Their classification rules are based on packet-level or flow-level traffic characteristics. Some well-designed traffic (e.g., anomalous traffic generated by stealth distributed denial of service attacks) will become invalid when handled. Therefore, in current network management, such as network anomaly detection and network visualization, converged different traffic level information is required. In order to overcome the above problems, it is urgently needed to develop a new flow classification method having the following requirements: high accuracy, the ability to identify unknown traffic classes, the use of different traffic level features to elaborate potential traffic data structures in fine granularity, may address problems caused by training data shortages.

The invention relates to a classification method based on correlation. Similar to most correlation-based classification methods, the present invention exploits the correlation between data streams to increase the accuracy of the classification. However, it should be noted that the existing classification schemes based on correlation have some problems discussed above, such as the inability to accurately identify unknown traffic; the traffic information of different levels is not fully utilized, etc.

Through the above analysis, the problems and defects of the prior art are as follows:

(1) most methods suffer from misclassification of unknown traffic. They cannot identify unknown traffic classes during the training phase, thereby attributing them to known traffic classes. This will affect the accuracy of the classification to a large extent;

(2) some methods are not always reliable. When the network fluctuates or the network environment changes, the accuracy of most methods becomes low;

(3) they cannot be used in conjunction with packet-level and flow-level features to perform traffic classification.

The difficulty in solving the above problems and defects is: although there are many methods that are continuously trying to improve the accuracy and reliability of classification, reliable and stable traffic classification still faces many difficulties. Firstly, due to the continuous development of networks, more and more applications bring massive data traffic, and many unknown traffic and even malicious traffic are mixed in the data traffic, which brings great difficulty to classification. Second, the collection of data sets and labels is also a difficult problem for traffic classification. How to obtain a large amount of real and reliable network traffic and correct data tags without violating user privacy still requires further research.

The significance of solving the problems and the defects is as follows: solving these problems can help network users get better service, and a more secure network environment. The flow data structure can be described more accurately by helping a researcher to improve the classification fine granularity on the basis of further improving the classification accuracy.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a network flow classification method based on constrained fuzzy clustering and particle computation.

The invention is realized in such a way that the network flow classification method based on the constrained fuzzy clustering and the particle calculation comprises the following steps:

in the training phase, merging the labeled data set with the unlabeled data set by using the flow information;

operating the merged data set through the CCFCM, and outputting a group of clustering centers in a numerical format;

constructing network flow particles around a numerical clustering center, and continuously optimizing under the guidance of a reasonable granularity criterion;

each flow particle is mapped to a corresponding flow category by means of the flow with the mark;

extracting data packet level and stream level characteristics from the NTG, and constructing a classification rule base;

in the test phase, the particle classifier identifies new network flows or network anomalies by means of classification rules.

Further, the method for constructing the network traffic particles comprises the following steps:

clustering network flow by using constraint fuzzy clustering;

step two, establishing network flow information particles and constructing a particle size classifier according to a clustering result;

and step three, classifying the network traffic based on the granularity classifier.

Further, in the step one,

first, network traffic is collected, and the following features are extracted from each flow: stream size, stream interval, maximum, minimum, mean and standard deviation of data packet size, maximum, minimum, mean and standard deviation of data packet arrival interval, number of bytes transmitted; given a label L ═ L₁,l₂,...,l_nSet of data streams S ═ S₁,s₂,...,s_nIn which s is_i∈R^qQ is the feature dimension, l_i∈(Class₁,Class₂,...Class_K)，i∈[1,n]，Class_p(p∈[1,K]) Is a traffic class, for another unlabeled dataset T ═ T₁,t₂,...}；

Secondly, clustering network flow, and guiding the direction of membership degree change in the clustering process by using an enhancement coefficient, wherein the enhancement coefficient is calculated only by the ratio of data streams containing must-link in each cluster, and the target function of CCFCM is as follows:

where m is the blurring coefficient, c is the number of clusters, N is the number of data streams, 0 ≦ u_ik1 is data stream x_kFor the clustering center v_iAnd must satisfy

Is the standard Euclidean distance, β_ikIs the enhancement factor;

β_ikthe calculation formula of (a) is as follows:

wherein RL_lIs a related subset of streams in RLS, Card (C)_i) Is in the ith cluster with x_kNumber of data streams for the most-link relationship, Card (RL)_l) Is RL_lThe number of data streams in.

Further, in step two, a cluster center v represented by numerical data is surrounded₁,v₂,...,v_cConstructing network traffic particles, and expressing the traffic particles as NTG ═ G₁,G₂,...,G_cIn which G is_i＝{G_i1,G_i2,...,G_iqQ is the flow characteristic dimension,

i

1, 2.

Further, the construction process of the flow particles is as follows:

(1) generating network traffic particles, generating network traffic particles using an epsilon-information particle rule, network traffic particle G_iStructurally similar to a hypercube structure, with each dimension calculated by G_ij＝[v_ij-ε/2*range_j,v_ij+ε/2*range_j]Wherein v is_ijIs a numerical center v_iThe j-th dimension has a value i 1,2,., c,

j

1,2,., q, e, G_iSize of (1), range_jIs the value variation range of the j dimension of the original data value;

(2) reconstructing raw data points from network traffic particles, from NTG { G ═ G₁,G₂,...,G_cReconstructed data point x in_kThe reconstructed data is an interval value expressed as

k

1,2,. N; if it is not

Then this network traffic particle pair x_kThe expression ability of (a) is considered to be good;

(3) and optimizing the network traffic particles.

Further, the method for calculating the membership degree comprises the following steps:

(1) if x_k∈

G

_i1,2, c, then x_kFor G_iDegree of membership of

The membership degree to other flow particles is

Using G_iIt is shown that,

(2) if it is not

Performing deblurring operation through membership degree aggregation and constructed network flow information particles to calculate

The calculation formula of (a) is as follows:

wherein

And

is G_iLower and upper bounds. For the j-th dimension, the number of the dimensions,

the calculation formula of (2) is as follows:

wherein

Further, in step three, the classification is judged based on the data points contained in the granules, if a flow granule does not contain any flow with a label, the flow granule is regarded as an unknown classification, and the flow granule is not used in the classification stage; for a grain comprising at least one tagged stream, assigning its traffic class based on a comparison of the number of streams in which the different tags are present;

setting a rule base of a packet level and a flow level, a classification rule of a single flow extracted from each network traffic particle, and an identification rule of an application program behavior extracted from each traffic class; for a single flow class, the distance between the flow and each network traffic particle is calculated.

Further, for a new flow y ∈ R^qThe classification is carried out according to the following steps:

if y ∈ G_i(i ═ 1, 2.., r), i.e., y_jAt the flow rate of particles G_ij(j ═ 1, 2.., q) internal, where G_i∈TC_p(p∈[1,K]) Then we label stream y as Class_p；

If

We need to calculate the flow y and the flow particles G_i(i ═ 1, 2.., r), and subjecting the same to a temperature reduction treatmentDivided to include G nearest to stream y_iSet category of TC_p(p∈[1,K]) In (1).

Another objective of the present invention is to provide a network anomaly detection and prevention method for implementing a network traffic classification method based on constrained fuzzy clustering and particle computation.

Another object of the present invention is to provide a big data analysis method implementing a network traffic classification method based on constrained fuzzy clustering and particle computation.

By combining all the technical schemes, the invention aims to classify the network traffic from the granularity perspective, which is a new calculation method for information processing, and the invention has the advantages and positive effects that:

(1) effectiveness: the present invention is effective because it uses the relevant traffic information as a priori knowledge of the clustered network traffic. The designed "custom constrained fuzzy C-means" (CCFCM) algorithm is a semi-supervised learning method using machine learning, which incorporates a priori knowledge to obtain results close to the user's expectations. In CCFCM, the invention adjusts membership by taking into account the ratio of must-link data points, using an enhancement factor. Compared with other constrained fuzzy C-means algorithms, the updating process of the membership matrix and the clustering center of the CCFCM is simple and effective.

(2) The accuracy is as follows: the optimization process of the network traffic particles makes the classification method of the invention more accurate than other existing methods. In order to mine the infrastructure of the traffic data, the invention firstly uses CCFCM to obtain rough description, and then uses an optimization rule of a reasonable granularity principle to construct network traffic particles based on a clustering result. The dual mining step may fully describe the structure of the traffic data and make the description more specific. The invention improves the accuracy of the particle classifier.

(3) Robustness: this patent is robust because it has the ability to identify unknown traffic classes. The discovery of unknown traffic classes is important because the number of unknown classes can greatly affect the accuracy of traffic classification. If unknown traffic cannot be identified and wrongly classified into known classes, the performance of the classifier will be degraded. The method of the present invention can accurately find unknown flows because it can provide a detailed description of the flow structure. This property makes the classification method of the present invention more robust in the presence of unknown traffic.

(4) And (3) the versatility is as follows: the present invention is general in that it can perform traffic classification and anomaly detection. Both of these functions are highly desirable in current flow measurements. The invention extracts packet-level and flow-level features from network traffic particles to build two rule bases to classify traffic classes and identify application behaviors. By adjusting the rule base, more flow measurement functions can be realized. Obviously, the classification method of the invention can be applied to discovering abnormal network behaviors, thereby being used as an effective network intrusion and threat detection method.

(5) And (3) expandability: the method of the present invention can easily extend this patent to accomplish the task of network security measurement. The present invention provides a new method for analyzing and modeling network data by innovating CCFCM and using the theory of granular computing. The present invention may use similar principles to process more types of traffic data and other types of data, for example, analyzing abnormal traffic data to perform network anomaly detection.

(6) Since the network traffic particles can describe the potential structure of the traffic data in detail, the classification accuracy of the traffic can be greatly improved.

(7) In the prior art, classification rules are directly established on a cluster center, wherein outliers are contained, and thus the classification is not accurate enough. The accuracy of the prior art classification depends on the quality of the clusters. The present invention constructs network traffic particles around the cluster center and classifies traffic using rules extracted from each traffic particle.

(8) The extraction rule of the invention is more specific, and the classification accuracy is greatly improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.

Fig. 1 is an exemplary diagram of a CCFCM provided by an embodiment of the present invention.

Fig. 2 is a network traffic information particle diagram according to an embodiment of the present invention.

Fig. 3 is a flowchart of a network traffic classification method based on constrained fuzzy clustering and particle computation according to an embodiment of the present invention.

Fig. 4 is a diagram illustrating specific steps performed in an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The present invention proposes the concept of network traffic particles and aims to build a particle-based classifier for classifying network traffic. Grain and particle size are concepts derived from grain calculations. It is a constantly evolving and powerful theory for solving complex problems, large-scale data mining and fuzzy information processing. In the invention, a novel clustering algorithm of custom Constrained Fuzzy C-Means (CCFCM) is designed. The algorithm incorporates a priori knowledge about traffic information to enhance the accuracy of network traffic clustering. A priori knowledge of the traffic information is the necessary linkage between related streams, i.e. two data streams must be clustered into the same cluster. Related flows are flows with the same destination address, destination port and transport protocol. In consideration of the stability of the network environment, they are generated by the same application for a certain short time. After clustering the flow, a plurality of cluster centers represented by numerical values can be obtained. The present invention constructs network traffic particles around these numerically represented cluster centers to improve the quality of the traffic data structure description. Since the construction of network traffic particles is an optimization problem, the particles are continuously optimized and updated according to particle optimization criteria, in order to describe the traffic structure in detail. Within each network traffic particle, the present invention extracts packet-level and flow-level rules to build different particle classifiers. By combining two-stage flow rules, the classifier of the invention can effectively and accurately classify various flow categories so as to further detect network abnormality. The present invention will be described in detail below with reference to the accompanying drawings.

The invention can realize two important functions of network flow measurement: traffic class classification and anomaly detection. In the training phase, the present invention uses relevant traffic information to merge labeled and unlabeled data sets. The invention can not only find the related stream, but also expand the training data set. The CCFCM operates on the merged dataset and outputs a set of cluster centers in numerical format. Network traffic particles (NTGs) are then constructed around the cluster centers of these numerical types. They will continually optimize under the direction of reasonable granularity criteria. This step is the core of the classification method of the present invention. After the optimization process is completed, the present invention will obtain the best NTG. With the marked streams, each traffic particle will be mapped to a corresponding traffic class. Packet-level and flow-level features are extracted from the NTG to build a classification rule base. In the test phase, the particle classifier identifies new network flows or network anomalies by means of classification rules.

The network flow classification method based on the constrained fuzzy clustering and the particle calculation comprises the following steps:

step one, clustering network flow by using constraint fuzzy clustering.

And step two, establishing network flow information particles and constructing a particle size classifier according to the clustering result.

The first step specifically comprises:

(1) collecting network flow: a flow is defined as a collection of packets having one or more of the same attributes. These same attributes, commonly referred to as stream keys, typically include packet header information, packet content, and meta-information. A flow can summarize network traffic information more than a packet. The invention selects the flow key words as source IP address, target IP address, source port, target port and transmission protocol. The present invention will extract the following features from each stream: flow size (number of packets); flow interval (the time interval from the arrival of the first packet to the expiration of the flow); maximum, minimum, mean and standard deviation of packet size; maximum, minimum, mean and standard deviation of packet arrival interval; the number of bytes transferred.

Related flows are flows with the same destination address, destination port and transport protocol. Because the host providing some applications will not change its services in a short time, the related streams are generated by the same application. Therefore, in the clustering process, related streams should be clustered into the same cluster. In addition, there is no cannot-link constraint relationship in the network traffic information, because the present invention cannot determine whether two flows belong to different traffic classes according to the header information.

Given a label L ═ L₁,l₂,...,l_nSet of data streams S ═ S₁,s₂,...,s_nIn which s is_i∈R^qQ is the feature dimension, l_i∈(Class₁,Class₂,...Class_K)，i∈[1,n]，Class_p(p∈[1,K]) Is the traffic class. For another unmarked dataset T ═ T₁,t₂,., the present invention solves the problem of training data set shortage by using relevant stream information to combine S and T to obtain a larger training data set. This combined training data set comprises two parts, the correlation stream set RLS ═ RL₁,RL₂,.. } and a separate stream set IS. The invention endows the data streams in the same related stream set with a must-link relation. And, if RL_lIncluding tagged streams, the present invention maps RL to a more compact format_lAll flows in (a) are classified into the category to which the flows with labels belong.

(2) Clustering network flow: the goal of custom constrained fuzzy C-means (CCFCM) is to separate the data stream as much as possible into the most clusters that contain data streams with a must-link relationship to the data stream during the clustering process. The invention uses an enhancement coefficient to guide the direction of membership change in the clustering process. The enhancement factor is calculated only by the ratio of the data streams containing the must-link in each cluster.

The target function of CCFCM is:

where m is the blurring coefficient, c is the number of clusters, N is the number of data streams, 0 ≦ u _ik1 is data stream x_kFor the clustering center v_iAnd must satisfy

I | - | is the standard Euclidean distance, β_ikIs the enhancement factor.

β_ikThe calculation formula of (a) is as follows:

wherein RL_lIs a related subset of streams in RLS, Card (C)_i) Is in the ith cluster with x_kNumber of data streams for the most-link relationship, Card (RL)_l) Is RL_lThe number of data streams in the stream, satisfies Card (C)_i)≤Card(RL_l).

β_ikHas the effect of increasing the data stream x_kMembership to a cluster containing a stream of data with a must-link relationship therewith, namely zoom-in x_kDistance from the cluster. An example is given in fig. 1. V for two cluster centers₁And v₂Their members are represented by black and green dots. Green dot and x_kHave a relationship of must-link therebetween. The invention can see a cluster v₁The number of green dots in is more than v₂Number of green dots in (1). Thus, the present invention increases x_kAnd v₁The degree of membership of (a) is increased to close the distance between them.

Fig. 1 is an exemplary diagram of CCFCM, and the update formula of the membership and the cluster center of CCFCM is as follows:

the second step specifically comprises:

after CCFCM is completed, the invention can obtain the clustering center represented by numerical data, namely v₁,v₂,...,v_c. Network traffic particles can be constructed around the clustering centers of the numerical types. The present invention expresses the flow particles as NTG ═ G₁,G₂,...,G_cIn which G is_i＝{G_i1,G_i2,...,G_iqQ is the flow characteristic dimension, i 1, 2. The construction of the flow particles is as follows.

(1) Generating network traffic particles: the purpose of this step is to construct a flow information particle based on the original numerical center. The present invention uses the epsilon-information granule rule to generate network traffic granules. Network traffic particle G_iStructurally similar to a hypercube structure, with each dimension calculated by G_ij＝[v_ij-ε/2*range_j,v_ij+ε/2*range_j]Wherein v is_ijIs a numerical center v_iThe j-th dimension has a

value i

1,2,., c,

j

1,2,., q, e, G_iSize of (1), range_jIs the range of the original data value in the jth dimension (i.e., the difference between the maximum and minimum values in the jth dimension).

To refine G_ijThe invention first finds out G_ijMiddle distance v_iThe furthest point. The invention then deletes the farthest data points and the empty regions between the boundaries so that these points are located at the boundaries. Fig. 2 shows the format of a two-dimensional traffic information particle. The dotted line boundary is the original network traffic particle and the solid line boundary is the compressed network traffic particle. The present invention can compress the size of the flow particles by discarding unused areas. The compression process makes the netThe net flow particle size is more specific.

Furthermore, it can be seen from fig. 2 that a network traffic particle is actually a collection of a series of data points. Using particles for representation, the present invention can represent flow information in a readable manner using fewer data points.

(2) Reconstructing an original data point based on the network traffic particles: the purpose of this step is to reconstruct the original data points from the network traffic particles in order to better test their representational capacity. For example, the present invention contemplates that NTG ═ G is selected from₁,G₂,...,G_cReconstructed data point x in_kThe reconstructed data is an interval value expressed as

k

1, 2. If it is not

Then this network traffic particle pair x_kThe expression ability of (A) can be considered as good. To calculate

The present invention requires to know x_kFor G_iI-1, 2, a, c. There are two ways to calculate the degree of membership.

1) If x_k∈

G

_i1,2, c, then x_kFor G_iDegree of membership of

The membership degree to other flow particles is

This means that G can be used according to the invention_iDenotes x_k. Therefore, when this method is used,

2) if it is not

The invention carries out deblurring operation through membership aggregation and constructed network flow information particles to calculate

The calculation formula of (a) is as follows:

wherein

And

the calculation formula of (2) is as follows:

wherein

(3) Optimizing network traffic particles: this step is very important because it guides the direction of the construction of the network traffic granularity. The present invention first introduces the principle of reasonable granularity, which is an optimization rule with two performance indicators, namely coverage and specificity.

Coverage-coverage requires that the flow particle contain as many raw data points as possible. The more data points contained in a network traffic particle, the better representation capability of the network traffic particle is represented. Through the reconstruction process of the original data points executed in the second step, the invention can calculate the number of the original data points contained in one network traffic particle. The coverage was calculated as follows:

where N is the number of raw data points,

is the number of data points contained in its reconstruction. The coverage will become larger as epsilon increases.

Specificity-specificity directs network traffic particles to cover as specific raw data points as possible. This means that the size of the traffic particles should be smaller to obtain a clearly defined semantic for the traffic. The smaller the network traffic particle, the more similar the data points contained therein. Thus, specificity decreases with increasing epsilon. The specific definition is as follows:

represents

The specificity was calculated as follows:

wherein a is_jAnd b_jIs the maximum and minimum value of the original data point in the jth dimension.

From the definitions of coverage and specificity, the invention can find that a competitive relationship exists between the two performance indexes. The greater the coverage, the less specificity. Both of these metrics are affected by the size of the network traffic particles epsilon. Therefore, how to balance coverage and specificity to obtain the optimal epsilon is the main motivation for the rational granularity principle. The present invention measures this competition relationship using the following quality assessment:

QA(ε)＝Coverage*Specificity^α

where α is a non-negative parameter, specificity is more important when α > 1 and coverage is more important when α < 1. When α is 1, coverage is as important as specificity.

The third step specifically comprises:

for each NTG, the present invention first determines its classification based on the data points contained in the grain. If a traffic particle does not contain any tagged flows, the invention considers it as an unknown class and does not use it in the classification phase. All other techniques can be used to analyze it, such as deep packet inspection. For pellets containing at least one tagged stream, the present invention assigns their traffic class based on a comparison of the number of streams in which the different tags are present.

For example, in the flow particle G_i(

i

1, 2.., r, r ≦ c), the tagged stream is { lf_i1,lf_i2,...lf_iuTheir corresponding categories are { ll_i1,ll_i2,...,ll_iuH, here ll_io∈(Class₁,Class₂,...Class_K) O ═ 1,2,. u. The invention can use the following formula to make the flow grain G_i(denoted as LG)_i) Set as Class_p(p∈[1,K])：

Traffic Class after all traffic particles are labeled_p(p∈[1,K]) May be represented by a set of network traffic particle classes, as follows:

TC_p＝{G_i,i＝1,2,...,r|LG_i＝Class_p}

next, the present invention may set up a rule base at the packet level and the flow level. The rule base contains two parts: classification rules for individual flows extracted from each network traffic particle, and identification rules for application behavior extracted from each traffic class. For a single flow class, the present invention requires the calculation of the distance between the flow and each network traffic particle. The following is the implementation of the particle classifier.

For a new flow y ∈ R^qThe invention classifies the following steps:

if y ∈ G_i(i ═ 1, 2.., r), i.e., y_jAt the flow rate of particles G_ij(j ═ 1, 2.., q) internal, where G_i∈TC_p(p∈[1,K]) Then the invention labels stream y as Class_p。

If

The invention needs to calculate the flow y and the flow grain G_i(

i

1, 2.. r), which is divided to include the G nearest to stream y_iSet category of TC_p(p∈[1,K]) In (1).

The present invention uses dis (y, G)_i) To represent y to G_iA distance of (1, 2.., r), which is calculated as follows:

dis(y_j,G_ij) Is y and G_i(i ═ 1, 2.., r) in the j-th dimension.

Next, the invention first calculates y and G_i(i ═ 1, 2.. times, r) in each traffic class TC_p(p∈[1,K]) The shortest distance of (c). Then, for all p traffic classes, the invention labels the class of y as the class TC with the shortest distance to it_p(p∈[1,K]). The formula is as follows:

the invention provides a novel constrained fuzzy C-means algorithm for clustering flow and solving the problem of undersize training data set. The invention can regard the related traffic information existing in the network traffic as the must-link relation in the semi-supervised learning (i.e. two related traffic must be divided into the same cluster). Therefore, in clustering network traffic, the present invention should keep the relevant flows as close as possible. CCFCM continuously adjusts the degree of membership of a given data point by considering the cardinality of the must-link data point in each cluster, with the relevance information as a priori knowledge. CCFCM is more efficient and faster than other constrained FCM algorithms because it updates the membership matrix by only considering the ratio of data points of the must-link in each cluster, rather than judging the relationships individually.

The invention establishes a novel expression form of network traffic, called network traffic particles. Each network traffic particle is a super-multidimensional dataset that contains many data points. Since its construction is an optimization process, the infrastructure of the traffic data can be fully captured. Traffic particles will provide many benefits, such as identifying incompatible data, reducing the amount of data that needs to be expressed, building a rule base at multiple traffic data levels.

The invention establishes two rule bases for network flow measurement based on network flow particles, namely a packet level rule base and a flow level rule base. Using these rule bases, many functions of network security metrics can be implemented, such as anomalous application detection, malicious traffic identification.

Based on the above discussion, the network traffic classification scheme of the present invention conforms to the following steps. First, the present invention uses the relevant traffic information to find the relevant flows in the training dataset. Then, the traffic data is divided into several cluster classes using CCFCM by taking into account a priori knowledge. After the CCFCM is executed, the invention obtains a group of clustering centers which are expressed in a numerical form. Next, under the optimization guidance of a reasonable granularity principle, the invention constructs network traffic particles around the numerical clustering centers. In this way, the present invention successfully promotes the representation of the traffic data from a numeric format to a granular format. In each traffic particle, the invention extracts packet-level and flow-level features to build two rule bases. Based on these libraries, the granular classifier can identify new flows, which can be used to detect network anomalies for network security measures.

In FIG. 4, first in the training phase, the present invention uses the relevant traffic information to merge labeled data sets with unlabeled data sets. Thus, the present invention can not only find the relevant streams, but also extend the training data set. The CCFCM operates on the merged dataset and outputs a set of cluster centers in numerical format. Network traffic particles (NTGs) are then constructed around the cluster centers of these numerical types. They will continually optimize under the direction of reasonable granularity criteria. This step is the core of the classification method of the present invention. After the optimization process is completed, the present invention will obtain the best NTG. With the marked streams, each traffic particle will be mapped to a corresponding traffic class. Packet-level and flow-level features are extracted from the NTG to build a classification rule base. In the test phase, the particle classifier identifies new network flows by means of classification rules.

In the description of the present invention, "a plurality" means two or more unless otherwise specified; the terms "upper", "lower", "left", "right", "inner", "outer", "front", "rear", "head", "tail", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing and simplifying the description, and do not indicate or imply that the device or element referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus, should not be construed as limiting the invention. Furthermore, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A network traffic classification method based on constrained fuzzy clustering and particle computation is characterized in that the network traffic classification method based on constrained fuzzy clustering and particle computation comprises the following steps:

operating the merged data set through a user-defined constraint fuzzy C mean value CCFCM, and outputting a group of clustering centers in a numerical format;

mapping each obtained best network traffic particle NTG to a corresponding traffic class by means of the flow with the mark;

in the testing stage, the particle classifier identifies new network flow or network abnormality by means of classification rules;

the method for constructing the network traffic particles comprises the following steps:

clustering network flow by using constraint fuzzy clustering;

classifying the network traffic based on a granularity classifier;

in the first step, the first step is carried out,

first, network traffic is collected, and the following features are extracted from each flow: stream size, stream interval, maximum, minimum, mean and standard deviation of data packet size, maximum, minimum, mean and standard deviation of data packet arrival interval, number of bytes transmitted; given a value of oneEach label is L ═ L₁,l₂,...,l_nSet of data streams S ═ S₁,s₂,...,s_nIn which s is_i∈R^qQ is the feature dimension, l_i∈(Class₁,Class₂,...Class_K)，i∈[1,n]，Class_p(p∈[1,K]) Is a traffic class, for another unlabeled dataset T ═ T₁,t₂,...}；

Secondly, clustering network flow, and guiding the direction of membership degree change in the clustering process by using an enhancement coefficient, wherein the enhancement coefficient is calculated only by the ratio of the must-link data flow contained in each cluster, and the target function of CCFCM is as follows:

I | - | is the standard Euclidean distance, β_ikIs the enhancement factor;

β_ikthe calculation formula of (a) is as follows:

if x_k∈RL_lAnd Card (C)_i)≠0

Wherein RL_lIs a related subset of streams in RLS, Card (C)_i) Is in the ith cluster with x_kNumber of data streams for the most-link relationship, Card (RL)_l) Is RL_lThe number of data streams;

in step two, a cluster center v represented by numerical data is surrounded₁,v₂,...,v_cConstructing network traffic particles, and expressing the traffic particles as NTG ═ G₁,G₂,...,G_cIn which G is_i＝{G_i1,G_i2,...,G_iqQ is a flow characteristic dimension, i 1, 2.., c;

the construction process of the flow particles is as follows:

(1) generating network traffic particles, generating network traffic particles using an epsilon-information particle rule, network traffic particle G_iStructurally similar to a hypercube structure, with each dimension calculated by G_ij＝[v_ij-ε/2*range_j,v_ij+ε/2*range_j]Wherein v is_ijIs a numerical center v_iThe j-th dimension has a value i 1,2,., c, j 1,2,., q, e, G_iSize of (1), range_jIs the value variation range of the j dimension of the original data value;

k 1,2,. N; if it is not

(3) optimizing network traffic particles;

the method for calculating the membership comprises the following steps:

(1) if x_k∈G_i1,2, c, then x_kFor G_iDegree of membership of

The membership degree to other flow particles is

G ≠ 1, 2., c and G ≠ i, using G_iIt is shown that,

(2) if it is not

i 1, 2.. c, calculating by carrying out deblurring operation on the membership aggregation and the constructed network flow information particles

The calculation formula of (a) is as follows:

wherein

And

is G_iThe lower and upper bounds, for the jth dimension,

the calculation formula of (2) is as follows:

wherein

In step three, judging the category based on the data points contained in the granules, if a flow granule does not contain any flow with a label, regarding the flow as an unknown category, and not using the flow in the classification stage; for a grain comprising at least one tagged stream, assigning its traffic class based on a comparison of the number of streams in which the different tags are present;

setting a rule base of a packet level and a flow level, a classification rule of a single flow extracted from each network traffic particle, and an identification rule of an application program behavior extracted from each traffic class; for a single flow class, calculating the distance between the flow and each network traffic particle;

for a new flow y ∈ R^qThe classification is carried out according to the following steps:

if y ∈ G_i(i ═ 1, 2.., r), i.e., y_jAt the flow rate of particles G_ij(j ═ 1, 2.., q) internal, where G_i∈TC_p(p∈[1,K]) Then the label of stream y is denoted as Class_p；

If

The flow y and the flow particles G need to be calculated_i(i 1, 2.. r), which is divided to include the G nearest to stream y_iSet category of TC_p(p∈[1,K]) In (1).

2. A network anomaly detection and prevention method for implementing the network traffic classification method based on constrained fuzzy clustering and particle computation of claim 1.

3. A big data analysis method implementing the constrained fuzzy clustering and particle computation based network traffic classification method of claim 1.