CN105471670A

CN105471670A - Flow data classification method and device

Info

Publication number: CN105471670A
Application number: CN201410462489.4A
Authority: CN
Inventors: 吴少勇; 喻敬海; 王延松; 吴春明
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2014-09-11
Filing date: 2014-09-11
Publication date: 2016-04-06
Anticipated expiration: 2034-09-11
Also published as: WO2015154484A1; CN105471670B

Abstract

The invention discloses a flow data classification method and device. The method comprises that a data packet is collected and recombined into flows to generate flow data, the service type of part of the flow data is marked, learning samples are formed correspondingly to different service types, and residual part of the flow data is set to a flow data set to be classified; a public value attribute feature set of each flow data of in the flow data set is extracted, and the flow data in the flow data set is arranged into a flow record composed of the public value attribute feature sets; and according to the learning samples, the public value attribute feature sets of the different service types are calculated in a sub-space clustering manner, and according to the calculated public value attributed feature sets as well as the public value attribute feature sets of the flow data in the flow record, the service types of the flow data in the flow data set are marked.

Description

Data on flows sorting technique and device

Technical field

The present invention relates to field of computer technology, particularly relate to a kind of data on flows sorting technique and device.

Background technology

In the prior art, the type of service classification of network traffics has the scope of application and high practical value widely.It can in real time to high bandwidth, and the network flow data in the port of high transfer rate carries out type of service classification accurately.Because types of service different in network is different to the demand of Internet resources, and the way to manage of people to the network traffics of different service types is different, thus efficiently, accurately network traffics business categorizing be network resource management and flow control etc. operation foundation.

The net flow assorted technology detecting (DeepPacketInspection, referred to as DPI) based on deep message needs to rely on corresponding type of service feature database, and the structure of feature database itself needs to rely on a large amount of artificial expense; Meanwhile, the class of business of existing network flow and feature can constantly occur change and upgrade.This just causes current DPI traffic classification technology can not to upgrade in time to flow service feature new in network, therefore also just cannot identify new service traffics.

Summary of the invention

Can not upgrade in time to flow service feature new in network in view of DPI traffic classification technology in prior art and the problem that the renewal efficiency caused is low and accuracy rate is easily degenerated, propose the present invention to provide a kind of data on flows sorting technique and device.

The invention provides a kind of data on flows sorting technique, comprising:

Packet capture is carried out at networks converge port, packet is reassembled as stream according to five-tuple, generate data on flows, and according to marking the type of service that a part of data on flows in data on flows is carried out in advance, for each class of business, corresponding formation learning sample, and the remainder in data on flows is set to data on flows set to be sorted;

Extract the public numerical attribute characteristic set of every bar data on flows in data on flows set, and the data on flows in data on flows set is organized into the stream record be made up of public numerical attribute characteristic set;

According to learning sample, the public numerical attribute characteristic set of each class of business in subspace clustering mode convection current record is adopted to calculate, and according to the public numerical attribute characteristic set of data on flows in the public numerical attribute characteristic set of each class of business calculated and stream record, type of service mark is carried out to the data on flows in flow data acquisition system.

Preferably, five-tuple comprises: source network Protocol IP address, object IP address, source port, destination interface and transport layer protocol.

Preferably, according to learning sample, the public numerical attribute characteristic set of each class of business in subspace clustering mode convection current record is adopted to carry out calculating specifically comprising:

Step 1, to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, a corresponding one-dimensional space is set up to each public numerical attribute, each territory element is sorted according to its coverage rate to the learning sample of a certain class of business, by entropy computation model, calculate the minimum vertex-covering rate that the learning sample of territory element to a certain class of business reaches, using the density threshold value of minimum vertex-covering rate as territory element;

Step 2, according to density threshold value, in the set of λ n-dimensional subspace n, delete coverage rate in λ n-dimensional subspace n and be less than the territory element of density threshold value, the coverage rate of territory element remaining in λ n-dimensional subspace n to the learning sample of a class of business is added, obtain the coverage rate of λ n-dimensional subspace n to One class learning sample, wherein, λ >=1;

Step 3, to each λ n-dimensional subspace n in the set of current λ n-dimensional subspace n, sort according to their coverage rates to the learning sample of a class of business, and adopt the shortest code length computation model, count in the set of current λ n-dimensional subspace n, the coverage rate that the learning sample of a λ n-dimensional subspace n to a class of business at least reaches, the coverage rate this at least reached is as the learning sample coverage rate threshold value of λ n-dimensional subspace n;

Step 4, in the set of current λ n-dimensional subspace n, delete the subspace that coverage rate is less than learning sample coverage rate threshold value, for every two the λ n-dimensional subspace ns in the set of current λ n-dimensional subspace n, only has the calculating carrying out λ+1 n-dimensional subspace n when both only have the attribute of dimension difference, now first search for their each self-contained territory elements, if the territory element numbering of the territory element of two different subspace in all identical dimensional attribute is all identical, then the learning sample that two territory elements comprise is sought common ground, if occured simultaneously not for empty, create the new unit of corresponding λ+1 n-dimensional subspace n, the new unit of continuous cycle calculations λ+1 n-dimensional subspace n, until all processed rear stopping between two of all λ n-dimensional subspace ns,

Step 5, according to all λ+1 n-dimensional subspace n set obtained, repeats step 2 to step 4, until stop after conforming to a predetermined condition, performs step 6;

Step 6, from the subspace set that number of dimensions is maximum, select the subspace that sample coverage rate is maximum, by maximum region computation model, obtain the expression formula of corresponding each cluster, the expression formula of all cluster results is represented by disjunctive normal form, obtains the public numerical attribute characteristic set of each class of business.

Preferably, predetermined condition is for meeting the following conditions one of at least:

Current dimension is that the subspace set of λ cannot synthesize the subspace that dimension number is λ+1;

New high-dimensional subspace after synthesis does not have coverage rate to be more than or equal to the territory element of density threshold value;

The dimension number of current subspace is maximum;

For the set of λ n-dimensional subspace n, when the current coverage rate that there is not the sample of λ n-dimensional subspace n is more than or equal to predetermined value.

Preferably, the number of regions unit numbers such as the dimension of each public numerical attribute all marks off specifically are comprised:

For stream record, the maximum that the public numerical attribute feature calculating each class of business can be got and minimum value, and using maximum and the minimum value span as public numerical attribute, and according to span, to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, wherein, each territory element is isometric.

Present invention also offers a kind of data on flows sorter, comprising:

Capture setting module, for carrying out packet capture at networks converge port, packet is reassembled as stream according to five-tuple, generate data on flows, and according to marking the type of service that a part of data on flows in data on flows is carried out in advance, for each class of business, corresponding formation learning sample, and the remainder in data on flows is set to data on flows set to be sorted;

Extracting sorting module, for extracting the public numerical attribute characteristic set of every bar data on flows in data on flows set, and the data on flows in data on flows set being organized into the stream record be made up of public numerical attribute characteristic set;

Calculate mark module, for according to learning sample, the public numerical attribute characteristic set of each class of business in subspace clustering mode convection current record is adopted to calculate, and according to the public numerical attribute characteristic set of data on flows in the public numerical attribute characteristic set of each class of business calculated and stream record, type of service mark is carried out to the data on flows in flow data acquisition system.

Preferably, calculate mark module specifically to comprise:

First process submodule, for to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, a corresponding one-dimensional space is set up to each public numerical attribute, each territory element is sorted according to its coverage rate to the learning sample of a certain class of business, by entropy computation model, calculate the minimum vertex-covering rate that the learning sample of territory element to a certain class of business reaches, using the density threshold value of minimum vertex-covering rate as territory element;

Second process submodule, for according to density threshold value, in the set of λ n-dimensional subspace n, delete coverage rate in λ n-dimensional subspace n and be less than the territory element of density threshold value, the coverage rate of territory element remaining in λ n-dimensional subspace n to the learning sample of a class of business is added, obtain the coverage rate of λ n-dimensional subspace n to One class learning sample, wherein, λ >=1;

3rd process submodule, for to each λ n-dimensional subspace n in the set of current λ n-dimensional subspace n, sort according to their coverage rates to the learning sample of a class of business, and adopt the shortest code length computation model, count in the set of current λ n-dimensional subspace n, the coverage rate that the learning sample of a λ n-dimensional subspace n to a class of business at least reaches, the coverage rate this at least reached is as the learning sample coverage rate threshold value of λ n-dimensional subspace n;

4th process submodule, for in the set of current λ n-dimensional subspace n, delete the subspace that coverage rate is less than learning sample coverage rate threshold value, for every two the λ n-dimensional subspace ns in the set of current λ n-dimensional subspace n, only has the calculating carrying out λ+1 n-dimensional subspace n when both only have the attribute of dimension difference, now first search for their each self-contained territory elements, if the territory element numbering of the territory element of two different subspace in all identical dimensional attribute is all identical, then the learning sample that two territory elements comprise is sought common ground, if occured simultaneously not for empty, create the new unit of corresponding λ+1 n-dimensional subspace n, the new unit of continuous cycle calculations λ+1 n-dimensional subspace n, until all processed rear stopping between two of all λ n-dimensional subspace ns,

5th process submodule, for according to all λ+1 n-dimensional subspace n set obtained, calls the second process submodule to the 4th process submodule, until stop after conforming to a predetermined condition, calls the 6th process submodule;

6th process submodule, in gathering for the subspace maximum from number of dimensions, select the subspace that sample coverage rate is maximum, by maximum region computation model, obtain the expression formula of corresponding each cluster, the expression formula of all cluster results is represented by disjunctive normal form, obtains the public numerical attribute characteristic set of each class of business.

The dimension number of current subspace is maximum;

Preferably, first process submodule specifically for:

Beneficial effect of the present invention is as follows:

In the data on flows identified on a small quantity as on the basis of learning sample, by Subspace clustering method, data on flows is classified, the problem that the low and accuracy rate of the renewal efficiency that the DPI traffic classification technology in prior art that solves can not upgrade in time to flow service feature new in network and cause easily is degenerated, technical scheme by means of the embodiment of the present invention only needs the type of service of the less sample data of handmarking, just can classify to all the other service traffics without manual identified, the effective sample of sufficient amount can be provided for setting up DPI feature database, the efficiency of the current DPI Automatic signature extraction of significant increase and renewal, it is made to possess more powerful adaptive ability to network environment.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of specification, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.

Accompanying drawing explanation

By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 is the flow chart of the data on flows sorting technique of the embodiment of the present invention;

Fig. 2 is the system configuration schematic diagram of the DPI pattern down-off data classification method of the embodiment of the present invention;

Fig. 3 is the flow chart of the detailed process of the DPI pattern down-off data classification method of the embodiment of the present invention;

Fig. 4 is the process chart of the bulk density thresholding of the embodiment of the present invention;

Fig. 5 is the process chart of the calculating coverage rate thresholding of the embodiment of the present invention;

Fig. 6 is the process chart that the acquisition of the embodiment of the present invention minimizes the disjunctive normal form of description;

Fig. 7 is the schematic diagram of the application of the traffic classification based on SDN safety of the embodiment of the present invention;

Fig. 8 is the schematic diagram of the traffic classification detected based on the traditional DPI application of the embodiment of the present invention;

Fig. 9 is the structural representation of the data on flows sorter of the embodiment of the present invention.

Embodiment

Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.

Can not upgrade in time to flow service feature new in network and the problem that the renewal efficiency caused is low and accuracy rate is easily degenerated to solve DPI traffic classification technology in prior art, the invention provides and promote the method that DPI network traffics service feature storehouse upgrades efficiency, be i.e. a kind of DPI pattern down-off data classification method and device.By the off-line grader based on Subspace clustering method, only need on the flow sample basis identified on a small quantity, the network flow sample of all unknown types of service of can classifying quickly and accurately, for the associated software product automatically extracting DPI feature database information provides abundant data sample.The embodiment of the present invention is simple to equipment requirement, and service traffics classification accurate rate is high, and show good stability in the application, be a kind of network traffics off-line Fast Classification device simultaneously.Below in conjunction with accompanying drawing and embodiment, the present invention is further elaborated.Should be appreciated that specific embodiment described herein only in order to explain the present invention, do not limit the present invention.

Embodiment of the method

According to embodiments of the invention, provide a kind of data on flows sorting technique, Fig. 1 is the flow chart of the data on flows sorting technique of the embodiment of the present invention, and as shown in Figure 1, the data on flows sorting technique according to the embodiment of the present invention comprises following process:

Step 101, packet capture is carried out at networks converge port, packet is reassembled as stream according to five-tuple, generate data on flows, and according to marking the type of service that a part of data on flows in data on flows is carried out in advance, for each class of business, corresponding formation learning sample, and the remainder in data on flows is set to data on flows set to be sorted;

In a step 101, five-tuple comprises: source network Protocol IP address, object IP address, source port, destination interface and transport layer protocol.

Step 102, extracts the public numerical attribute characteristic set of every bar data on flows in data on flows set, and the data on flows in data on flows set is organized into the stream record be made up of public numerical attribute characteristic set;

Step 103, according to learning sample, the public numerical attribute characteristic set of each class of business in subspace clustering mode convection current record is adopted to calculate, and according to the public numerical attribute characteristic set of data on flows in the public numerical attribute characteristic set of each class of business calculated and stream record, type of service mark is carried out to the data on flows in flow data acquisition system.

In step 103, according to learning sample, the public numerical attribute characteristic set of each class of business in subspace clustering mode convection current record is adopted to carry out calculating specifically comprising:

In step 1, the number of regions unit numbers such as the dimension of each public numerical attribute all marks off specifically are comprised: for stream record, the maximum that the public numerical attribute feature calculating each class of business can be got and minimum value, and using maximum and the minimum value span as public numerical attribute, and according to span, to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, wherein, each territory element is isometric.

Step 5, according to all λ+1 n-dimensional subspace n set obtained, repeats step 2 to step 4, until stop after conforming to a predetermined condition, performs step 6; Wherein, above-mentioned predetermined condition is for meeting the following conditions one of at least:

1, current dimension be λ subspace set cannot synthesize the subspace that dimension number is λ+1;

2, the new high-dimensional subspace after synthesis does not have coverage rate to be more than or equal to the territory element of density threshold value;

3, the dimension number of current subspace is maximum;

4, for the set of λ n-dimensional subspace n, when the current coverage rate that there is not the sample of λ n-dimensional subspace n is more than or equal to predetermined value;

In the prior art, the application type of the flow sample data needing handmarking all before setting up DPI feature database.The deficiency of this method be one be need a large amount of artificial, easy occurrence flag error; Two is the feature databases that cannot upgrade in time.The embodiment of the present invention is by using entropy model, and the shortest code length computation model carries out beta pruning calculating, proposes Subspace clustering method, realizes the automatic mark to flow sample data, effectively can reduce human cost, improves and upgrades efficiency.

The technical scheme of the embodiment of the present invention has carried out the study to service traffics cluster feature by the mode of subspace clustering, and no longer relies on the type of service markers work with manual type completed sample certificate, and accuracy rate and the efficiency of off-line classification are higher.Simultaneously, subspace clustering algorithm by less, has the characterization rules that the learning sample data of service label obtain, when for Data classification without service label, the accurate rate close to 100% can be reached, service traffics data can be ensured not by the impact of noise data.The computational process of Subspace clustering method make use of entropy model and the shortest coding computation model, and the parameter that algorithm itself is relied on has intelligibility, also makes the execution result of algorithm have good stability.The technical scheme of the embodiment of the present invention is conducive to the adaptive capacity that DPI grader strengthens network environment, and study is to new service feature more in time, and identifies corresponding service traffics.

Below in conjunction with accompanying drawing, the technique scheme of the embodiment of the present invention is described in detail.

Fig. 2 is the system configuration schematic diagram of the DPI pattern down-off data classification method of the embodiment of the present invention, Fig. 2 shows the processing procedure of the input and output of the data on flows sorting technique of the embodiment of the present invention, and shows the position of off-line grader in DPI based on subspace clustering.Fig. 3 is the flow chart of the detailed process of the DPI pattern down-off data classification method of the embodiment of the present invention, Fig. 4 is the process chart of the bulk density thresholding of the embodiment of the present invention, Fig. 5 is the process chart of the calculating coverage rate thresholding of the embodiment of the present invention, Fig. 6 is the process chart that the acquisition of the embodiment of the present invention minimizes the disjunctive normal form of description, as seen in figures 3-6, following process is specifically comprised:

(1) the full bag collection of packet is carried out at networks converge port;

(2) packet packet is carried out flowing (flow) restructuring according to source IP, object IP, source port, destination interface, transport layer protocol, obtain flowing sample, and calculate the public numerical attribute characteristic set of every bar flow, form stream record, attributive character set is as shown in table 1;

Table 1

(3) to the stream record that step (2) obtains, pass through manual type, for often kind of applied business type (above-mentioned become class of business), extract 3000 ~ 7000 corresponding streams, and mark its type of service, form corresponding learning sample, remaining unlabelled stream record is then stream record (corresponding to above-mentioned data on flows set to be sorted) to be sorted;

(4) to the public numerical attribute characteristic set of every bar flow that step (3) obtains, add up maximum and the minimum value of each attribute, as the span that attribute is corresponding, and universal formulation all properties span is multiple isometric regions, the number in isometric region is appoint between 10000 ~ 15000 to get a value, forms the regional ensemble of attribute; Wherein, territory element number is arranged between 10000 to 15000, and in each attribute dimensions, territory element is from 0 open numbering;

(5) regional ensemble of the attribute that the stream recording learning sample of the correspondence one class business obtained according to step (3) and step (4) obtain.For each attribute sets up the corresponding one-dimensional space, and add up the number of sample stream in the corresponding region of each attribute, as the coverage rate of this region to the learning sample of a class business, the flow process of the density threshold value of zoning unit as shown in Figure 4, specifically comprises following process:

Table 2

For one-dimensional subspace set, N number of grid cell is sorted from big to small according to sample coverage rate, then calculates entropy H:

H = - (Σ_{n = 1}^{k} p_{n} \log p_{n} + Σ_{n = k + 1}^{N} p_{n} \log p_{n}) = - [kα \log α + (N - k) β \log β] - - - (1)

Regions all time initial is all set as dense Region, first calculates first entropy, then progressively using coming the unit at end as non-dense unit, recalculating entropy, making entropy H reach the sample coverage rate p of minimum value _n, be exactly the cell density threshold value required by algorithm.

(6) to the cell density threshold value that step (5) obtains, in current each λ n-dimensional subspace n, delete all coverage rates and be less than p _ninterval, dense unit coverage rate is added, obtains the coverage rate of each λ n-dimensional subspace n to the learning sample of a class business;

(7) all λ n-dimensional subspace ns of sample coverage rate will be calculated in the middle of step (6), according to the learning sample coverage rate of subspace to class service traffics, sort from big to small, then the coverage rate threshold value of λ n-dimensional subspace n set is calculated according to the shortest coding computation model, calculate the flow process of subspace sample coverage rate threshold value as shown in Figure 4, specific as follows:

CL (i) = \log_{2} (μ_{I} (i)) + \underset{1 \leq j \leq i}{Σ} \log_{2} (| x_{sj} - μ_{I} (i) |)) + \log_{2} (μ_{P} (i)) + \underset{i + 1 \leq j \leq m}{Σ} \log_{2} (| x_{sj} - μ_{P} (i) |)) - - - (4)

Wherein x _sjwhat represent is the coverage rate of subspace to learning sample, and m represents the number of subspace, and i representative is for calculating the number of the λ n-dimensional subspace n of λ+1 dimension subspace, μ _ii () representative sample coverage rate is more than or equal to the average sample coverage rate of all subspaces of threshold value, μ _pi () representative sample coverage rate is less than the average sample coverage rate of all subspaces of threshold value.Think time initial that the coverage rate of all subspaces is all greater than threshold value, now can calculate initial code length CL (i), then remove the little subspace of coverage rate gradually, recalculate code length.As the coverage rate x that whole code length obtains the most in short-term _si, be exactly subspace coverage rate threshold value;

(8) the coverage rate threshold value x of the λ n-dimensional subspace n set utilizing step (7) to obtain _si, delete the λ n-dimensional subspace n that all coverage rates are less than coverage rate threshold value;

(9) all λ n-dimensional subspace ns being greater than coverage rate threshold value are utilized in step (8), attempt structure λ+1 n-dimensional subspace n: only have when two λ n-dimensional subspace ns only have a dimensional attribute difference just to carry out the calculating of λ+1 n-dimensional subspace n, now first search for their each self-contained territory elements, if the territory element numbering of the territory element of two different subspace in all identical dimensional attribute is all identical, then seek common ground to the learning sample that two territory elements comprise, occuring simultaneously is not the empty new unit then creating corresponding λ+1 n-dimensional subspace n;

(10) repeat the step of (6) to (9), until there is any one condition following for the moment, algorithm stops:

2, the new high-dimensional subspace after synthesis does not have dense unit (dense unit refers to that coverage rate is more than or equal to the territory element of density threshold value);

3, current subspace dimension number has been maximum;

4, current dimension is do not have the sample coverage rate of subspace to be more than or equal to 75% in the subspace set of λ, and this can ensure the sample coverage rate that high-dimensional subspace can reach higher.

(11) in the most high-dimensional subspace set step (10) obtained, Rule Extraction is carried out in the subspace that coverage rate is larger, rule employing minimizes description expression formula and represents, computational minimization describes the flow process of expression as shown in Figure 6, specifically comprises following process:

(11-1) the dense unit set with syntople (coplanar) is searched in λ n-dimensional subspace n, all coplanar dense unit is summarized as identical small set at every turn, if then there is coplanar dense unit pair between small set, then such small set just can merge, merging can terminate when the dense unit set of current neighbor adjoins the dense unit pair of dense unit set without any coplanar relation with other, the cluster result cluster that what this time, we obtained is exactly in k n-dimensional subspace n, a λ n-dimensional subspace n can have multiple different cluster result clusters.

(11-2) for each cluster, all randomly draw a unit at every turn, if the not processed mistake of this dense unit, from active cell, the maximum rectangular area that can cover this unit need be looked for, all unit in this rectangular area are all dense unit, and they are labeled as processed, repeat step (11-2) until all unit are all processed.

(11-3) the cluster expression formula finally obtained is as follows:

＜(ID _a1u _a1len _a1),...,(ID _aqu _aqlen _aq)＞V＜(ID _b1u _b1len _b1),...＞V...,1≤q≤λ

Wherein, ID _aqwhat represent is the dimension numbering that the dense unit of initial point is corresponding, u _aqrepresentative be then the numbering in interval of dense unit belonging in this dimension, len _aqwhat represent is then the total number in interval that corresponding maximal cover region comprises in this dimension, and q represents the numbering of dimension.These expression formulas are exactly the business rule extracted, and they can carry out accurate class of service division to the sample data without service label, meet the extraction demand of DPI public field feature.

Fig. 7 is the schematic diagram of the application of the traffic classification based on SDN safety of the embodiment of the present invention, as shown in Figure 7, safety detection application based on SDN repeater system comprises: SDN controller issues sampling instruction, SDN switch sampling flow will send DPI detection system, and the FPCLIQUE subspace clustering business categorizing system of the embodiment of the present invention, DPI detection system is helped to carry out business categorizing, better to carry out flow detection.For abnormal flow, then DPI detection system feeds back to controller, and controller issues corresponding forwarding strategy, abandons, monitors, cleaning etc.

Fig. 8 is the schematic diagram of the traffic classification detected based on the traditional DPI application of the embodiment of the present invention, as shown in Figure 8, safety detection application based on conventional DPI systems comprises: through the flow of DPI system, through DPI checkout equipment, the strategy of DPI equipment utilizes the FPCLIQUE subspace clustering business categorizing system of the embodiment of the present invention, DPI detection system can be better helped to carry out business categorizing, better to carry out flow detection.For abnormal flow, then DPI equipment adopts corresponding forwarding strategy, abandons, monitors, cleaning etc.

Device embodiment

According to embodiments of the invention, provide a kind of data on flows sorter, Fig. 9 is the structural representation of the data on flows sorter of the embodiment of the present invention, as shown in Figure 9, data on flows sorter according to the embodiment of the present invention comprises: capture setting module 90, extraction sorting module 92 and calculating mark module 94, be described in detail the modules of the embodiment of the present invention below.

Capture setting module 90, for carrying out packet capture at networks converge port, packet is reassembled as stream according to five-tuple, generate data on flows, and according to marking the type of service that a part of data on flows in data on flows is carried out in advance, for each class of business, corresponding formation learning sample, and the remainder in data on flows is set to data on flows set to be sorted; Wherein, five-tuple comprises: source network Protocol IP address, object IP address, source port, destination interface and transport layer protocol.

Extracting sorting module 92, for extracting the public numerical attribute characteristic set of every bar data on flows in data on flows set, and the data on flows in data on flows set being organized into the stream record be made up of public numerical attribute characteristic set;

Calculate mark module 94, for according to learning sample, the public numerical attribute characteristic set of each class of business in subspace clustering mode convection current record is adopted to calculate, and according to the public numerical attribute characteristic set of data on flows in the public numerical attribute characteristic set of each class of business calculated and stream record, type of service mark is carried out to the data on flows in flow data acquisition system.

Calculate mark module 94 specifically to comprise:

First process submodule, for to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, a corresponding one-dimensional space is set up to each public numerical attribute, each territory element is sorted according to its coverage rate to the learning sample of a certain class of business, by entropy computation model, calculate the minimum vertex-covering rate that the learning sample of territory element to a certain class of business reaches, using the density threshold value of minimum vertex-covering rate as territory element; First process submodule specifically for:

5th process submodule, for according to all λ+1 n-dimensional subspace n set obtained, calls the second process submodule to the 4th process submodule, until stop after conforming to a predetermined condition, calls the 6th process submodule; Preferably, predetermined condition is for meeting the following conditions one of at least:

3, the dimension number of current subspace is maximum;

4, for the set of λ n-dimensional subspace n, when the current coverage rate that there is not the sample of λ n-dimensional subspace n is more than or equal to predetermined value.

In sum, in the data on flows identified on a small quantity as on learning sample basis, by Subspace clustering method, data on flows is classified, the problem that the low and accuracy rate of the renewal efficiency that the DPI traffic classification technology in prior art that solves can not upgrade in time to flow service feature new in network and cause easily is degenerated, only need the type of service of the less sample data of handmarking, just can classify to all the other service traffics without manual identified, the effective sample of sufficient amount can be provided for setting up DPI feature database, the efficiency of the current DPI Automatic signature extraction of significant increase and renewal, it is made to possess more powerful adaptive ability to network environment.

Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.

In specification provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and adaptively can change the module in the client in embodiment and they are arranged in one or more clients different from this embodiment.Block combiner in embodiment can be become a module, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this specification (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or client or unit.Unless expressly stated otherwise, each feature disclosed in this specification (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.

In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary compound mode.

All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of some or all parts be loaded with in the client of sequence network address that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.

The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computer of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims

1. a data on flows sorting technique, is characterized in that, comprising:

Packet capture is carried out at networks converge port, described packet is reassembled as stream according to five-tuple, generate data on flows, and according to marking the type of service that a part of data on flows in described data on flows is carried out in advance, for each class of business, corresponding formation learning sample, and the remainder in described data on flows is set to data on flows set to be sorted;

Extract the public numerical attribute characteristic set of every bar data on flows in described data on flows set, and the data on flows in described data on flows set is organized into the stream record be made up of described public numerical attribute characteristic set;

According to described learning sample, subspace clustering mode is adopted to calculate the public numerical attribute characteristic set of each class of business in described stream record, and according to the public numerical attribute characteristic set of data on flows in the public numerical attribute characteristic set of each class of business calculated and described stream record, type of service mark is carried out to the data on flows in described data on flows set.

2. the method for claim 1, is characterized in that, described five-tuple comprises: source network Protocol IP address, object IP address, source port, destination interface and transport layer protocol.

3. the method for claim 1, is characterized in that, according to described learning sample, adopts subspace clustering mode to calculate the public numerical attribute characteristic set of each class of business in described stream record and specifically comprises:

Step 1, to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, a corresponding one-dimensional space is set up to each public numerical attribute, each territory element is sorted according to its coverage rate to the learning sample of a certain class of business, by entropy computation model, calculate the minimum vertex-covering rate that the learning sample of territory element to a certain class of business reaches, using the density threshold value of described minimum vertex-covering rate as territory element;

Step 2, according to described density threshold value, in the set of λ n-dimensional subspace n, delete coverage rate in λ n-dimensional subspace n and be less than the territory element of described density threshold value, the coverage rate of territory element remaining in λ n-dimensional subspace n to the learning sample of a class of business is added, obtain the coverage rate of λ n-dimensional subspace n to One class learning sample, wherein, λ >=1;

Step 4, in the set of current λ n-dimensional subspace n, delete the subspace that coverage rate is less than described learning sample coverage rate threshold value, for every two the λ n-dimensional subspace ns in the set of current λ n-dimensional subspace n, only has the calculating carrying out λ+1 n-dimensional subspace n when both only have the attribute of dimension difference, now first search for their each self-contained territory elements, if the territory element numbering of the territory element of two different subspace in all identical dimensional attribute is all identical, then the learning sample that two territory elements comprise is sought common ground, if occured simultaneously not for empty, create the new unit of corresponding λ+1 n-dimensional subspace n, the new unit of continuous cycle calculations λ+1 n-dimensional subspace n, until all processed rear stopping between two of all λ n-dimensional subspace ns,

4. method as claimed in claim 3, is characterized in that, predetermined condition is for meeting the following conditions one of at least:

New high-dimensional subspace after synthesis does not have coverage rate to be more than or equal to the territory element of described density threshold value;

The dimension number of current subspace is maximum;

5. method as claimed in claim 3, is characterized in that, specifically comprise number of regions unit numbers such as the dimension of each public numerical attribute all mark off:

For described stream record, the maximum that the public numerical attribute feature calculating each class of business can be got and minimum value, and using described maximum and the described minimum value span as public numerical attribute, and according to described span, to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, wherein, each territory element is isometric.

6. a data on flows sorter, is characterized in that, comprising:

Capture setting module, for carrying out packet capture at networks converge port, described packet is reassembled as stream according to five-tuple, generate data on flows, and according to marking the type of service that a part of data on flows in described data on flows is carried out in advance, for each class of business, corresponding formation learning sample, and the remainder in described data on flows is set to data on flows set to be sorted;

Extracting sorting module, for extracting the public numerical attribute characteristic set of every bar data on flows in described data on flows set, and the data on flows in described data on flows set being organized into the stream record be made up of described public numerical attribute characteristic set;

Calculate mark module, for according to described learning sample, subspace clustering mode is adopted to calculate the public numerical attribute characteristic set of each class of business in described stream record, and according to the public numerical attribute characteristic set of data on flows in the public numerical attribute characteristic set of each class of business calculated and described stream record, type of service mark is carried out to the data on flows in described data on flows set.

7. device as claimed in claim 6, it is characterized in that, described five-tuple comprises: source network Protocol IP address, object IP address, source port, destination interface and transport layer protocol.

8. device as claimed in claim 6, it is characterized in that, described calculating mark module specifically comprises:

First process submodule, for to number of regions unit numbers such as the dimension of each public numerical attribute all mark off, a corresponding one-dimensional space is set up to each public numerical attribute, each territory element is sorted according to its coverage rate to the learning sample of a certain class of business, by entropy computation model, calculate the minimum vertex-covering rate that the learning sample of territory element to a certain class of business reaches, using the density threshold value of described minimum vertex-covering rate as territory element;

Second process submodule, for according to described density threshold value, in the set of λ n-dimensional subspace n, delete coverage rate in λ n-dimensional subspace n and be less than the territory element of described density threshold value, the coverage rate of territory element remaining in λ n-dimensional subspace n to the learning sample of a class of business is added, obtain the coverage rate of λ n-dimensional subspace n to One class learning sample, wherein, λ >=1;

4th process submodule, for in the set of current λ n-dimensional subspace n, delete the subspace that coverage rate is less than described learning sample coverage rate threshold value, for every two the λ n-dimensional subspace ns in the set of current λ n-dimensional subspace n, only has the calculating carrying out λ+1 n-dimensional subspace n when both only have the attribute of dimension difference, now first search for their each self-contained territory elements, if the territory element numbering of the territory element of two different subspace in all identical dimensional attribute is all identical, then the learning sample that two territory elements comprise is sought common ground, if occured simultaneously not for empty, create the new unit of corresponding λ+1 n-dimensional subspace n, the new unit of continuous cycle calculations λ+1 n-dimensional subspace n, until all processed rear stopping between two of all λ n-dimensional subspace ns,

9. device as claimed in claim 8, is characterized in that, predetermined condition is for meeting the following conditions one of at least:

The dimension number of current subspace is maximum;

10. device as claimed in claim 8, is characterized in that, described first process submodule specifically for: