A kind of parallel network flow sorting technique based on ontology knowledge reasoning
Technical field
The present invention relates to technical field of network management, specially a kind of parallel network flow based on ontology knowledge reasoning point
Class method.
Background technology
With the continuous improvement of the fast development and IT application in enterprises demand of Web technologies, many new network application models
It comes into being with application demand, thing followed network flow data also shows explosive increase, before being brought to network supervision
The challenge not having, but also the demand that user carries out network flow fine-grained management is more and more stronger.As management and it is excellent
Change the key technology of disparate networks resource, net flow assorted is widely used in network monitoring, QoS (Quality of
Service, service quality) fields such as management, network security, Study on Trend, be it is efficient realize network management, flow control and
The important link of safety detection.
Net flow assorted refer in the internet based on ICP/IP protocol, according to network application type (such as
WWW, FTP, MAIL, P2P etc.), two-way TCP flow amount or UDP flow amount that network communication generates are classified.
Many researchers have directed attention to the machine learning classification side based on network flow statistic feature in recent years
Method, according to the statistical information of certain attributes (such as average packet length, average inter-packet gap time) of flow, using machine learning method
Classify to flow, this method is not influenced by dynamic port, payload encryption and network address translation.Network flow point at present
The relatively broad machine learning method used of class mainly has:Bayes, neural network, support vector machines and decision tree etc..
The net flow assorted technique study of Cambridge University Moore is mainlyBayes and its improved method are ground
Study carefully.Charalampos Rotsos and Moore etc. introduce semi-supervised traffic classification method and train grader, using NB and kernel estimates
Two kinds of algorithms of NB model grader, the experimental results showed that this method can obtain higher classification performance than conventional method.But
Be such algorithm it is the learning method based on probability statistics, excessively relies on the distribution of sample space, there is potential unstability.
It is effectively eliminated based on port or based on load using the net flow assorted method of feedforward neural network
The drawbacks of sorting technique, test verification this method has better stability and robustness compared with NB, in net flow assorted
Using with good performance and foreground.But even the extensive BP algorithm of Application of Neural Network, also exposes in the application
Many defects such as easily form local minimum and cannot get global optimum, and frequency of training makes learning efficiency low more, convergence rate
It is slow etc..
Network flow parameters are obtained from network data packet header, then carry out regular deviation training and zero deflection training comparison
Svm classifier algorithm, when handling big-sample data collection, computation complexity is high, and training speed is slow.Network is carried out with SVM decision trees
Traffic classification, solving the problems, such as SVM traffic classifications, there are None- identified region and training time are longer.However research still cannot
Calculated performance bottleneck problem is thoroughly solved, and this method is a kind of learning method having supervision, cannot find network well
New opplication in flow.
WeiLi and Moore extracts 12 in order to avoid the load of detection packet since the network packet network flow
Statistical nature, while considering delay and handling capacity, classification accuracy is up to 99.8% under C4.5 decision tree traffic classification methods.
Tomasz Bujlow et al. propose a kind of C5.0 machine learning algorithms, are averaged classification accuracy by the experimental verification algorithm
Reach 99.3-99.9%.But decision tree lacks retractility, and the additional of increase sorting algorithm is easy when handling large data sets
Expense reduces the accuracy of classification.
Under high speed large-scale complex network environment, each sensor network node uses different network flow acquisition systems
System collection network data packet, network flow data format differ, semantic, syntactic metacharacter.Therefore the characteristics of current network flow data
It is multi-source, isomery, magnanimity, existing net flow assorted technology can only carry out simple format to network flow data mostly
Change, lacks the effective workaround to Heterogeneous data (format isomery, syntactic metacharacter, Semantic Heterogeneous), also lack to flow information
The description of (such as obtain environment) and knowledge reasoning, the data on flows of acquisition there are inconsistency, cannot share and lack network
The problems such as traffic classification knowledge, thus existing traffic classification method is difficult to provide the resource letter needed for network management decisions analysis
Breath.
In artificial intelligence field, ontology is gradually applied to integrated knowledge engineering, intelligent information, data mining, magnanimity letter
In the fields such as the tissue of breath and processing.Ontology is solves the problems, such as that resource specification, unambiguity and scalability describe to have provided
The approach of effect, in terms of describing resource have versatility, opening, intelligent, accuracy and it is comprehensive many advantages, such as.Ontology
Also it is used for DSS as a kind of tool of knowledge representation, knowledge reasoning is weight of the ontology in DSS
Function is wanted, classification (image classification etc.) problem is also applied to.
Recent study person attempts to introduce ontology to net flow assorted field.Pietrzyk, Marcin attempt shape for the first time
Formula defines the classification of stream, and using classical exploitation ontology criterion, iteration builds a category classification tree based on ontology example,
It is intended to eliminate the ambiguity that traffic category defines.Chengjie Gu et al. propose a kind of online self-study based on stream profile and ontology
Net flow assorted frame is practised, traffic classification is realized by the mapping relations flowed between profile and traffic classes.But current base
It can't be applied to large-scale complex network in the net flow assorted method of ontology, ontology is answered net flow assorted field
With still belonging to the starting stage.
Cloud computing is data-centered intensive supercomputing technology, is handled large data sets, is analyzed, and to
User provides High-effective Service, has the characteristics that parallelization, virtualization, on-demand service.Its parallel processing technique MapReduce can
Large-scale data parallel computation process problem for that can divide provides sufficient parallel computation semanteme, widely accepted.Cloud
Computing technique is solves the problems, such as that mass data processing provides new method in net flow assorted.Therefore, ontology and cloud computing phase
It is conjointly employed in net flow assorted, advantage of each in terms of the description of magnanimity isomeric data is with processing, ontology will be played
For the description of network traffic information resource consistency and information management, and cloud computing provides for the structure of ontology and information management
Storage and computing resource.
Invention content
The purpose of the present invention is disclosing a kind of parallel network flow sorting technique based on ontology knowledge reasoning, for big rule
Network flow example in lay wire network flowmeter body realizes network flow point by the knowledge reasoning of machine learning method and ontology
Class.
A kind of parallel network flow sorting technique based on ontology knowledge reasoning that the present invention designs, according to Internet
The network flow ontology of the information resource achitecture multilayer of flow collection environment and flow, by every network flow pair in internet
A network flow example in network flow ontology is answered, is classified as follows to network flow:
I, it establishes Decision-Tree Classifier Model and generates set of inference rules
Network flow is chosen in internet as sample, the network flow sample of marked application type is as network flow
Training sample set is measured, network flow training sample set is trained using decision Tree algorithms, establishes the decision tree classification mould of network flow
Type, and Decision-Tree Classifier Model is converted to set of inference rules;
II, parallelization classification is carried out to network flow example by knowledge reasoning
The set of inference rules that step I generates is configured to by corresponding inference machine using Jena kits, to the net built
Network flowmeter body calls inference machine to carry out parallel knowledge reasoning, that is, excavates network by MapReduce parallel computation frames
The correspondence of network flow example and network application type in flowmeter body carries out network application type to network flow example
Label completes net flow assorted.The Jena kits are the kit for ontological construction and its reasoning, are 2004
The open source code semantic net kit based on Java of Hewlett-Packard Corporation's exploitation.
Each step is described in detail below.
The step I specifically includes following sub-step:
I -1, the network flow training sample set of marked application type is trained by decision Tree algorithms, establishes net
The Decision-Tree Classifier Model of network flow, the set A={ a1,a2,……,aiIndicate to concentrate i by network flow training sample
The set of the statistical characteristics composition of network flow;Set T={ t1,t2,……,tjIndicate by network flow training sample set
The set of application type composition belonging to middle j kinds network flow;Set V={ v1,v2,……,vkIndicate to be judged by k decision
The set of a reference value composition, it is calculated by each element in set A by decision Tree algorithms, as in decision tree
Choose the judgment basis of decision path;
I -2, it is accordingly to be regarded as classification path from root node to the path of each cotyledon in the Decision-Tree Classifier Model of network flow,
Using decision determinating reference value as foundation, every classification path in the Decision-Tree Classifier Model of network flow is transformed into " such as
Fruit-is then ", i.e. " IF-THEN " structure establishes the network flow classified model of IF-THEN structures;
I -3, the network flow for the IF-THEN structures established using the inference rule syntactic description step I -2 of Jena kits
Disaggregated model is measured, and generates set of inference rules.
The step II specifically includes following sub-step:
II -1, the set of inference rules that step I generates is configured to by corresponding inference machine using Jena kits;
II -2, the number of the network flow example described in the performance of each calculate node and network flow ontology
According to scale, the network flow ontology built is split, obtains multiple network flow ontology fragments, by network flow sheet
Body fragment is uploaded to Hadoop distributed file systems, and is identified to each network flow ontology fragment;
II -3, mapping (Map) function for starting multiple MapReduce, with<Network flow ontology segmental identification accords with, network
Flowmeter body fragment>For key-value pair, it is input to mapping function;
II -4, mapping function carries out knowledge reasoning using the inference machine that step II -1 constructs to network flow ontology fragment,
Obtain the corresponding network application type label of every network flow example in network flow ontology fragment;
II -5, with<Network application type label, network flow example>For key-value pair, it is output to stipulations function;
II -6, stipulations function merges network flow example according to network application type label, forms sorter network flow
Example set;
II -7, sorter network flow example set is exported, net flow assorted is completed.
Compared with prior art, the advantages of a kind of parallel network flow sorting technique based on ontology knowledge reasoning of the present invention
For:1, the parallel processing technique MapReduce of large-scale dataset is introduced, therefore cloud computing can be used and know as network flow ontology
Storage and the computing resource for knowing reasoning, provide the High-effective Service with parallelization, virtualization, on-demand service to the user;
2, parallelization classification is carried out to network flow example by knowledge reasoning, effectively improves classification effectiveness;It is appropriate to increase calculate node
It can accelerate to complete to classify;3, in conjunction with the knowledge reasoning of machine learning method and ontology, by build set of inference rules directly against
Network flow example in network flow ontology is effectively classified.
Description of the drawings
Fig. 1 is the general frame based on the parallel network flow sorting technique embodiment of ontology knowledge reasoning;
Fig. 2 is the Organization Chart based on the parallel network flow sorting technique embodiment step II of ontology knowledge reasoning;
Fig. 3 is the parallel network flow sorting technique embodiment stand-alone environment and cluster environment based on ontology knowledge reasoning
Lower knowledge reasoning classification time contrast curve;
Fig. 4 is parallel network flow sorting technique embodiment different data scale, the difference based on ontology knowledge reasoning
Speed-up ratio curve graph under the cluster environment of node.
Specific implementation mode
Cambridge University mole (Moore) is used based on the parallel network flow sorting technique embodiment of ontology knowledge reasoning
It teaches team's acquisition and disclosed data set is used as network traffic information resource, this example referred to as mole data set, used in this example
Mole data set includes 377526 network flow samples, and each network flow sample therein is complete biography transport control protocol
(TCP) bidirectional traffics are discussed, there are 248 network flow statistic features, it is basic by the source port number of network flow, destination slogan etc.
The statistical attributes such as the Mean Time Between Replacement of attribute and packet form, last is labeled as the application type belonging to network flow.
This example chooses mole 12 kinds of network application types of data concentration as class object, 12 kinds of network application types
For:WWW (www), game (Games), service (Service), mail (Mail), attack (Attack), database
(Database), interaction (Interactive), File Transfer Protocol control (FTP-Control), File Transfer Protocol passively connect
Connect (FTP-Pasv), File Transfer Protocol data (FTP-Data), multimedia (Multimedia) and point-to-point (P2P).Choosing altogether
It is server end slogan, client to take foundation of 10 network flow statistic features as knowledge reasoning, selected 10 statistical natures
In end port numbers, the packet in the same direction being forwarded in the total bytes of contained data, the reserved packet being forwarded contained data total byte
It is transmitted in the total number, all reserved packets of contained push (PUSH) flag bit in transmission control protocol packet header in several, all packets in the same direction
Transmission control protocol packet header is contained in the contained total number for pushing (PUSH) flag bit in control protocol packet header, all packets in the same direction terminates
(FIN) the contained total number for terminating (FIN) flag bit in transmission control protocol packet header in the total number of flag bit, all reserved packets,
The total bytes of all initialization packet windows in the same direction, the total bytes of all reserved packet initial windows.
In order to have more objectivity, a mole data set is split into two parts by this example, respectively as the training sample set of this example
And test sample collection, it randomly selects 3000 from training sample concentration and is used as training sample, randomly selected from test sample concentration
300000 are used as test sample.
Based on the general frame of the parallel network flow sorting technique embodiment of ontology knowledge reasoning as shown in Figure 1, originally
Example is according to a mole network flow ontology for data set structure multilayer, by every network flow in the test sample of mole data set
A network flow example in corresponding network flowmeter body, using decision Tree algorithms to the network flow of marked application type
Training sample is trained, and establishes the Decision-Tree Classifier Model of network flow, and Decision-Tree Classifier Model is converted to reasoning rule
Then collect, set of inference rules is configured to by corresponding inference machine using Jena kits;To the network flow ontology that has built by
MapReduce parallel computation frames call inference machine to carry out parallel knowledge reasoning, that is, excavate network flow in network flow ontology
The correspondence for measuring example and network application type carries out network application type mark to network flow example, completes network flow
Amount classification.
I, it establishes Decision-Tree Classifier Model and generates set of inference rules
I -1, by the included decision Tree algorithms of machine learning and data mining software weka3.7.10 to the instruction of this example
Practice sample set to be trained, establish the Decision-Tree Classifier Model of network flow, this example set A indicates that the training sample of this example is concentrated
The statistical nature value set of network flow, set A=server end slogan, client end slogan, be forwarded it is in the same direction packet in institute
Total bytes containing data, transmission control protocol in the total bytes, all packets in the same direction of contained data in the reserved packet being forwarded
Contained push (PUSH) mark in transmission control protocol packet header in the contained total number for pushing (PUSH) flag bit in packet header, all reserved packets
It is the contained total number for terminating (FIN) flag bit in transmission control protocol packet header in the total number of will position, all packets in the same direction, all reversed
Terminate the total number of (FIN) flag bit, the total byte of all initialization packet windows in the same direction in packet contained by transmission control protocol packet header
The total bytes of several, all reserved packet initial windows };Set T indicates that the training sample of this example is concentrated belonging to network flow
Application type set, set T=WWW is played, service, mail, attack, database, interaction, File Transfer Protocol control,
File Transfer Protocol passively connects, File Transfer Protocol data, and multimedia is point-to-point };Set V={ v1,v2,……,vkTable
Show the set being made of k decision determinating reference value, it is calculated by each element in set A by decision Tree algorithms
Go out, as the judgment basis for choosing decision path in decision tree.
I -2, it is accordingly to be regarded as classification path from root node to the path of each cotyledon in the Decision-Tree Classifier Model of network flow,
Using decision determinating reference value as foundation, every classification path in the Decision-Tree Classifier Model of network flow is transformed into " such as
Fruit-is then ", i.e. " IF-THEN " structure establishes the network flow classified model of IF-THEN structures;
I -3, the network flow for the IF-THEN structures established using the inference rule syntactic description step I -2 of Jena kits
Disaggregated model is measured, and generates set of inference rules.
II, parallelization classification is carried out to network flow example by knowledge reasoning
The set of inference rules that step I generates is configured to corresponding inference machine by this step using Jena kits, to structure
The network flow ontology built up, by MapReduce parallel computation frames, call Jena inference machines to carry out parallel knowledge reasoning,
The correspondence for excavating network flow example and network application type in network flow ontology carries out network flow example
Network application type mark completes net flow assorted.It specifically includes such as following sub-steps, as shown in Figure 2:
II -1, the set of inference rules that step I generates is configured to by corresponding inference machine using Jena kits;
II -2, the number of the network flow example described in the performance of each calculate node and network flow ontology
According to scale, the network flow ontology built is split, obtains multiple network flow ontology fragment (ontologies in Fig. 2
Fragment O1To On), network flow ontology fragment is uploaded to Hadoop distributed file systems, and to each network flow sheet
Body fragment is identified;
II -3, mapping (Map) function (Map1 to the Map n in Fig. 2) for starting multiple MapReduce, with<Network flow
Ontology segmental identification accords with, network flow ontology fragment>For key-value pair, it is input to mapping function;
II -4, mapping function carries out knowledge reasoning using the inference machine that step II -1 constructs to network flow ontology fragment,
Obtain the corresponding network application type label of every network flow example (the type L in Fig. 2 in network flow ontology fragment1It arrives
Lm);
II -5, with<Network application type label, network flow example>For key-value pair, it is output to stipulations function;
II -6, stipulations function (Reduce1 to the Reduce m in Fig. 2) merges network flow according to network application type label
Example is measured, sorter network flow example set (the flow set C in Fig. 2 is formed1To flow set Cm);
II -7, sorter network flow example set is exported, net flow assorted is completed.
To verify the validity of the method for the present invention, to heterogeneous networks data on flows scale, under stand-alone environment and cluster environment
The knowledge reasoning classification time is compared, and comparing result is as shown in Figure 3.Abscissa is network flow instance number in Fig. 3, and unit is
Ten thousand;Ordinate is the classification time, and unit is the second.▽ lines indicate that single machine, lines indicate that 2 machines, ◇ lines indicate in Fig. 3
3 machines, △ lines indicate 4 machines.From figure 3, it can be seen that when network flow instance number is less, the calculate node of different numbers
Lead time needed for net flow assorted is little.In flow sample number only has 60,000 small-scale classification tasks, single machine ring
The classification time needed for border even lower than only opens the cluster environment of 2 nodes, approaches the collection group rings for opening 3 nodes
Border.Because when network flow instance data amount is less, the scheduler task of MapReduce and segmentation and recombination data and etc.
There is still a need for expend the regular hour.It can thus be appreciated that the processing for small-scale data, can not embody the advantage of the method for the present invention.
But with the increase of network flow instance data scale, the gap of the classification spent time of single machine and cluster environment is just increasingly
Greatly, the overhead of MapReduce gradually tends towards stability at this time, and the advantage of parallel processing is gradually shown in the method for the present invention
Come, embodies the high efficiency of the method for the present invention parallel processing.
In order to more accurately weigh the promotion that the method for the present invention uses the obtained aspect of performance of Parallelizing Techniques, use
Speed-up ratio R is as evaluation index:
R=Ts/Tp
Variable T in formulasIndicate the run time of this method under stand-alone environment, variable TpIndicate this method under parallel environment
Run time.Fig. 4 gives when cluster environment is using 2,3,4 machines, i.e., when calculate node is respectively 2,3,4, this method
Speed-up ratio curve graph.Abscissa is network flow instance number in Fig. 4, and unit is ten thousand;Ordinate is the net flow assorted time
Speed-up ratio.▽ lines indicate that 2 machines, lines indicate that 3 machines, ◇ indicate 4 machines in Fig. 4.As shown in figure 4, working as network flow
One timing of instance number is measured, with the increase of calculate node, phase step type variation is presented in speed-up ratio;With network flow instance number
Increase, speed-up ratio is gradually reduced after increasing to a maximum value, tends towards stability later.By to each node operating status
Observation with analysis it is found that when network flow instance number is smaller, the resource utilization of cluster is not high, the resource of each calculate node
It is not used effectively;With the increase of network flow example, speed-up ratio is presented nose-up tendency, increases to maximum value, collect at this time
The resource utilization of group reaches highest, and the resource of each node can be dispatched well in cluster;With network flow example
Number continues growing, and speed-up ratio is gradually reduced, and is then tended to be steady.This is because speed-up ratio reach maximum value when cluster resource profit
With bottleneck is had reached, the scheduler of cluster starts to adjust scheduling strategy, is finally reached a stable state.
The experimental results showed that, this method can effectively improve execution efficiency above, and MapReduce concurrent techniques can have
Improve the classification effectiveness of network flow example in large-scale network traffic ontology in effect ground.
Above-described embodiment is only further described the purpose of the present invention, technical solution and advantageous effect specific
A example, present invention is not limited to this.All any modifications made within the scope of disclosure of the invention, change equivalent replacement
Into etc., it is all included in the scope of protection of the present invention.