CN105471639A - Median-based network flow entropy evaluation method and apparatus - Google Patents

Median-based network flow entropy evaluation method and apparatus Download PDF

Info

Publication number
CN105471639A
CN105471639A CN201510816499.8A CN201510816499A CN105471639A CN 105471639 A CN105471639 A CN 105471639A CN 201510816499 A CN201510816499 A CN 201510816499A CN 105471639 A CN105471639 A CN 105471639A
Authority
CN
China
Prior art keywords
entropy
data flow
estimated value
network data
counter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510816499.8A
Other languages
Chinese (zh)
Other versions
CN105471639B (en
Inventor
杨家海
王子玉
李晨曦
张世泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201510816499.8A priority Critical patent/CN105471639B/en
Publication of CN105471639A publication Critical patent/CN105471639A/en
Application granted granted Critical
Publication of CN105471639B publication Critical patent/CN105471639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a median-based network flow entropy evaluation method and apparatus. The method includes the following steps: obtaining a data packet of network data flow and sending the data packet to slave nodes in a Storm cluster; receiving estimations value of an intermediate item S returned by the slave nodes, wherein the estimation values of the S are obtained through a preset entropy estimation algorithm by the slave nodes in the Storm cluster according to the received data packet; ordering the received estimation values of the S according to the magnitude; obtaining a median of the ordered estimation values of the S, taking the median as a final estimation value of the S; and obtaining an estimation value of an entropy of the network data flow according to the final estimation value of the S. In the method, the Storm cluster includes at least one salve node. The method increases the accuracy of entropy estimation while the storage space and the calculating complexity are not increased.

Description

Based on network traffics entropy evaluation method and the device of median
Technical field
The present invention relates to network traffic analysis technical field, particularly relate to a kind of network traffics entropy evaluation method based on median and device.
Background technology
In network traffic analysis field, the analysis based on flow point cloth can obtain than based on the abundanter result of simple stream quantitative analysis usually.The feature of flow distribution can be summarized and refine to entropy well, and be the extremely important one tolerance of network traffics, such as, the algorithm of a lot of Network anomaly detection all depends on the entropy of network flow.How quick on the net in high-speed backbone, the entropy of computing network flow is for abnormality detection in real time, traffic engineering, and the field tools such as flow analysis are of great significance.Calculate entropy and need the size recording each network flow, but the flow of backbone network is very large, the number of interior network flow per minute is tens of thousands easily, wants the state recording every bar stream to need to consume a large amount of memory spaces, increase the complexity of calculating widely, be difficult to accomplish real-time calculating.
And the entropy evaluation method of prior art, often memory space and computation complexity lower but entropy estimation accuracy rate not high, or the higher but memory space of entropy estimation accuracy rate and computation complexity higher.
Given this, the accuracy rate how improving entropy estimation when not increasing memory space and computation complexity becomes the current technical issues that need to address.
Summary of the invention
For solving above-mentioned technical problem, the invention provides a kind of network traffics entropy evaluation method based on median and device, the accuracy rate of entropy estimation can be improved when not increasing memory space and computation complexity.
First aspect, the invention provides a kind of network traffics entropy evaluation method based on median, comprising:
Obtain the packet of network data flow, and be sent to each from node of Storm cluster;
The estimated value of each middle entry S returned from node receiving described Storm cluster, the estimated value of described S be described Storm cluster each from node according to the packet received, obtained by default entropy estimating algorithm;
The estimated value of the S of reception is sorted according to size;
Obtain the median of estimated value of all S after sequence, using described median as the final estimated value of S
According to described obtain the estimated value of the entropy of described network data flow network traffics
Wherein, described Storm cluster comprises at least one from node.
Alternatively, described middle entry S is obtained by the first formulae discovery, and described first formula is:
S = 1 m Σ i m i logm i ;
Wherein, the packet in described network data flow represents with i, m ifor the frequency that packet i occurs in network data flow, i ∈ 1,2 ..., 2 4b, b is the byte length of individual data bag in described network data flow.
Alternatively, described in described basis obtain the estimated value of the entropy of network traffics comprise:
According to described by the second formula, obtain the estimated value of the entropy of network traffics
Described second formula is:
H ~ = l o g ( m ) - S ~ ,
Wherein, m is the total length of network data flow, and m is obtained by the 3rd formulae discovery;
Described 3rd formula is:
m = Σ i = 1 n m i ,
N is positive integer.
Alternatively, described default entropy estimating algorithm is entropy estimation approximate data, comprising:
Random selecting g × z position in described network data flow, generates random site set;
For the packet a of jth in described network data flow j=i, judges whether j belongs to described random site set, if j belongs to described random site set, then for i increases a counter, and the counter set be associated with i is all added 1; If j does not belong to described random site set, and there is the counter be associated with i, then the counter be associated with i is added 1;
After having judged whether the j of all packets in described network data flow belongs to described random site set, obtain the counter matrices C=(c of g × z pq);
According to described counter matrices C, build matrix X=(x pq);
Obtain the mean value of described each row element of matrix X;
Obtain the median of all mean value, and using the estimated value of this median as S;
Wherein, x pq=m × (c pqlogc pq-(c pq-1) log (c pq-1)), ε is relative evaluated error, and 1-δ is for estimating accuracy rate.
Alternatively, the mean value of each row element of the described matrix X of described acquisition, comprising:
By the 4th formula, obtain the mean value avg [p] of described each row element of matrix X;
Wherein, described 4th formula is:
a v g [ p ] = 1 z Σ q = 1 z x p q , p = 1 , 2 , ... , g ; q = 1 , 2 , ... , z .
Alternatively, described default entropy estimating algorithm is entropy estimation filtering algorithm, comprising:
According to default sample rate, the packet in the network data flow obtained is sampled;
Judge a jth packet a in described network data flow jwhether=i is selected and whether there is the counter be associated with i;
If a jth packet a in network data flow j=i is selected, and there is not the counter be associated with i, then create a counter for i and i is labeled as rill;
If a jth packet a in network data flow j=i is selected, and there is the counter be associated with i, then i is labeled as large stream;
If a jth packet a in network data flow j=i is not selected, and there is the counter be associated with i, then the counter be associated with i is added 1;
Judging a jth packet a in network data flow jwhether=i is selected and after whether there is the counter that is associated with i, obtains being labeled as the counter matrices E of large stream and being labeled as the counter matrices M of rill;
According to the described counter matrices E being labeled as large stream, obtain the contribution margin S of large stream e;
According to the described counter matrices M being labeled as rill, build matrix Y=(y pq);
Obtain the mean value of described each row element of matrix Y;
Obtain the median of all mean value, and using the contribution margin S of this median as rill m;
According to described S eand S m, obtain the estimated value of S;
Wherein, E=(E t), t=1,2 ..., e; E is positive integer;
M=(M pq),p=1,2,…,g;q=1,2,…,z;
y pq=m×(M pqlogM pq-(M pq-1)log(M pq-1)),p=1,2,…,g;q=1,2,…,z。
Alternatively, described in described basis, be labeled as the counter matrices E of large stream, obtain the contribution margin S of large stream e, comprising:
According to the described counter matrices E being labeled as large stream, by the 5th formula, obtain the contribution margin S of large stream e;
Wherein, described 5th formula is:
S e = Σ t = 1 e E t logE t , t = 1 , 2 , ... , e ;
And/or,
Described according to described S eand S m, obtain the estimated value of S, comprising:
According to described S eand S m, by the 6th formula, obtain the estimated value of S;
Wherein, described 6th formula is:
S ~ = S e + S m ;
And/or,
The mean value of each row element of the described matrix Y of described acquisition, comprising:
By the 7th formula, obtain the mean value avg [p] of described each row element of matrix Y;
Wherein, described 7th formula is:
a v g [ p ] = 1 z Σ q = 1 z y p q , p = 1 , 2 , ... , g ; q = 1 , 2 , ... , z .
Second aspect, the invention provides a kind of network traffics entropy estimating device based on median, comprising:
First acquisition module, for obtaining the packet of network data flow, and is sent to each from node of Storm cluster;
Receiver module, for the estimated value of each middle entry S returned from node of receiving described Storm cluster, the estimated value of described S be described Storm cluster each from node according to the packet received, obtained by default entropy estimating algorithm;
Order module, the estimated value for the S by reception sorts according to size;
Second acquisition module, for obtaining the median of the estimated value of all S after sequence, using described median as the final estimated value of S
3rd acquisition module, described in basis obtain the estimated value of the entropy of described network data flow network traffics
Wherein, described Storm cluster comprises at least one from node.
Alternatively, described middle entry S is obtained by the first formulae discovery, and described first formula is:
S = 1 m Σ i m i logm i ;
Wherein, the packet in described network data flow represents with i, m ifor the frequency that packet i occurs in network data flow, i ∈ 1,2 ..., 2 4b, b is the byte length of individual data bag in described network data flow.
Alternatively, described 4th acquisition module, specifically for
According to described by the second formula, obtain the estimated value of the entropy of network traffics
Described second formula is:
H ~ = l o g ( m ) - S ~ ,
Wherein, m is the total length of network data flow, and m is obtained by the 3rd formulae discovery;
Described 3rd formula is:
m = Σ i = 1 n m i ,
N is positive integer.
As shown from the above technical solution, the network traffics entropy evaluation method based on median of the present invention and device, calculating in real time by Storm platform is parallel, can improve the accuracy rate of entropy estimation when not increasing memory space and computation complexity.
Accompanying drawing explanation
The schematic flow sheet of a kind of network traffics entropy evaluation method based on median that Fig. 1 provides for one embodiment of the invention;
Fig. 2 is the Storm cluster topology schematic diagram of the network traffics entropy evaluation method based on median provided embodiment illustrated in fig. 1 when the default entropy estimating algorithm in embodiment illustrated in fig. 1 is entropy estimation approximate data;
Fig. 3 is the Storm cluster topology schematic diagram of the network traffics entropy evaluation method based on median provided embodiment illustrated in fig. 1 when the default entropy estimating algorithm in embodiment illustrated in fig. 1 is entropy estimation filtering algorithm;
Fig. 4 is that a part for Storm cluster embodiment illustrated in fig. 1 uses entropy estimation approximate data from node, the Storm cluster topology schematic diagram of the network traffics entropy evaluation method based on median provided embodiment illustrated in fig. 1 when residue another part uses entropy estimation filtering algorithm from node;
The structural representation of a kind of network traffics entropy estimating device based on median that Fig. 5 provides for one embodiment of the invention.
Embodiment
For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, clear, complete description is carried out to the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on embodiments of the invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
Fig. 1 shows the schematic flow sheet of the network traffics entropy evaluation method based on median that one embodiment of the invention provides, and as shown in Figure 1, the network traffics entropy evaluation method based on median of the present embodiment is as described below.
101, obtain the packet of network data flow, and (by message source spout) is sent to each from node of Storm cluster.
Wherein, described Storm cluster comprises at least one from node.
In a particular application, network flow data bag described in the present embodiment can carry the five-tuple information be made up of the agreement interconnected between source network (InternetProtocol is called for short IP) address, object IP address, source port, destination interface and protocol number.
Will be understood that, this step 101 continuously grabs the packet of network data flow by network interface card, can suppose all packets all from certain feature space 1,2,3 ..., n}.
The estimated value of each middle entry S returned from node 102, receiving described Storm cluster, the estimated value of described S be described Storm cluster each from node according to the packet received, obtained by default entropy estimating algorithm.
In a particular application, described middle entry S is obtained by the first formulae discovery, and described first formula is:
S = 1 m Σ i m i logm i ;
Wherein, the packet in described network data flow represents with i, m ifor the frequency that packet i occurs in network data flow, i ∈ 1,2 ..., 2 4b, b is the byte length of individual data bag in described network data flow.
For example, source IP address can be expressed as the binary coding of 4 bytes, and by that analogy, five-tuple can be expressed as the binary coding of 13 bytes.That is, all packets can with one 1 ~ 2 52an integer i represent, different i is corresponding different network flow.
103, the estimated value of the S of reception is sorted according to size.
104, the median of estimated value of all S after sequence is obtained, using described median as the final estimated value of S
105, described in basis obtain the estimated value of the entropy of described network data flow network traffics
In a particular application, described step 105, can comprise:
According to described by the second formula, obtain the estimated value of the entropy of network traffics
Described second formula is:
H ~ = l o g ( m ) - S ~ ,
Wherein, m is the total length of network data flow, and m is obtained by the 3rd formulae discovery;
Described 3rd formula is:
m = Σ i = 1 n m i ,
N is positive integer.
The network traffics entropy evaluation method based on median of the present embodiment, calculating in real time by Storm platform is parallel, can improve the accuracy rate of entropy estimation when not increasing memory space and computation complexity.
In an embody rule, presetting entropy estimating algorithm described in the present embodiment can be entropy estimation approximate data, comprises not shown step S1-S6:
S1, in described network data flow random selecting g × z position, generate random site set.
Wherein, ε is relative evaluated error, and 1-δ is for estimating accuracy rate.
S2, for the packet a of jth in described network data flow j=i, judges whether j belongs to described random site set, if j belongs to described random site set, then for i increases a counter, and the counter set be associated with i is all added 1; If j does not belong to described random site set, and there is the counter be associated with i, then the counter be associated with i is added 1.
S3, after having judged whether the j of all packets in described network data flow belongs to described random site set, obtain the counter matrices C=(c of g × z pq).
S4, according to described counter matrices C, build matrix X=(x pq).
Wherein, x pq=m × (c pqlogc pq-(c pq-1) log (c pq-1)), p=1,2 ..., g; Q=1,2 ..., z.
S5, obtain the mean value of described each row element of matrix X.
In a particular application, described step S5, by the 4th formula, obtains the mean value avg [p] of described each row element of matrix X;
Wherein, described 4th formula is:
a v g [ p ] = 1 z Σ q = 1 z x p q , p = 1 , 2 , ... , g ; q = 1 , 2 , ... , z .
S6, obtain the median of all mean value, and using the estimated value of this median as S.
This entropy estimation approximate data is a kind of (ε, δ) approximate data, and its implication to be more than or equal to the probability of (1-δ), with the relative error of maximum ε estimation entropy, can have following formula:
Pr ( | S - S ~ | ≤ S ϵ ) ≥ 1 - δ .
Will be understood that, this entropy estimation approximate data, by the selection to z, can realize the control to relative evaluated error ε, by the selection to g, can realize the control to estimating accuracy rate (1-δ).
The advantage of this entropy estimation approximate data only needs reading data flow, do not need to store each packet, only need to safeguard g × z counter, greatly reduce demand and the computation complexity of memory space.Therefore, this algorithm is well suited for the entropy of estimation on line network traffics.But if evaluated error ε to be reduced and increase the accuracy rate (1-δ) estimated, certainly will to increase the value of z and g, thus need the counter number safeguarded greatly to increase.
Fig. 2 shows the Storm cluster topology schematic diagram of the network traffics entropy evaluation method based on median that described default entropy estimating algorithm provides for embodiment of the present invention during entropy estimation approximate data, one of them random site generates bolt, a counting machine generates bolt and and generates the example that S estimated value bolt just constitutes algorithm one, bolt is Message Processing person, the example of n part above-mentioned entropy estimation approximate data is contained in this storm topology, the bolt that the result of each example has delivered to the most right side sorts, get median as final entropy estimated value.
See Fig. 2, the present embodiment based on median network traffics entropy evaluation method Storm platform make Storm cluster each use above-mentioned entropy to estimate from node approximate data walks abreast calculates the estimated value of S in real time, then by the median of the estimated value using all S after sequence as the final estimated value of S and then according to described obtain the estimated value of the entropy of network data flow network traffics the accuracy rate of entropy estimation can be improved when not increasing memory space and computation complexity.
In another embody rule, presetting entropy estimating algorithm described in the present embodiment can be entropy estimation filtering algorithm, comprises not shown step T1-T8:
T1, according to default sample rate r, to obtain network data flow in packet sample.
T2, judge a jth packet a in described network data flow jwhether=i is selected and whether there is the counter be associated with i; If a jth packet a in network data flow j=i is selected, and there is not the counter be associated with i, then create a counter for i and i is labeled as rill; If a jth packet a in network data flow j=i is selected, and there is the counter be associated with i, then i is labeled as large stream; If a jth packet a in network data flow j=i is not selected, and there is the counter be associated with i, then the counter be associated with i is added 1.
T3, judging a jth packet a in network data flow jwhether=i is selected and after whether there is the counter that is associated with i, obtains being labeled as the counter matrices E of large stream and being labeled as the counter matrices M of rill.
Wherein, E=(E t), t=1,2 ..., e; E is positive integer; M=(M pq), p=1,2 ..., g; Q=1,2 ..., z.
Be labeled as the counter matrices E of large stream described in T4, basis, obtain the large contribution margin S flowed e.
In a particular application, described step T4 according to the described counter matrices E being labeled as large stream, by the 5th formula, can obtain the contribution margin S of large stream e;
Wherein, described 5th formula is:
S e = Σ t = 1 e E t logE t , t = 1 , 2 , ... , e .
Be labeled as the counter matrices M of rill described in T5, basis, build matrix Y=(y pq).
Wherein, y pq=m × (M pqlogM pq-(M pq-1) log (M pq-1)), p=1,2 ..., g; Q=1,2 ..., z.
T6, obtain the mean value of described each row element of matrix Y.
In a particular application, described step T6, by the 7th formula, obtains the mean value avg [p] of described each row element of matrix Y;
Wherein, described 7th formula is:
a v g [ p ] = 1 z Σ q = 1 z y p q , p = 1 , 2 , ... , g ; q = 1 , 2 , ... , z .
T7, obtain the median of all mean value, and using the contribution margin S of this median as rill m.
T8, according to described S eand S m, obtain the estimated value of S.
In a particular application, described step T8 can according to described S eand S m, by the 6th formula, obtain the estimated value of S;
Wherein, described 6th formula is:
S ~ = S e + S m .
This entropy estimation filtering algorithm is a kind of filtering algorithm, and its core concept distinguishes large stream (elephants) and rill (mice), calculates their contributions for final entropy respectively.
Owing to having distinguished size stream, can be more accurate with this entropy estimation filtering algorithm estimation network traffics entropy, but memory space and computation complexity also can slightly increase.Obviously, if we will increase estimation precision and the accuracy rate of filtering algorithm further, just need to increase sample rate r, this inherently increases the burden of process.
Similar Fig. 2, Fig. 3 show the Storm cluster topology schematic diagram of the network traffics entropy evaluation method based on median that described default entropy estimating algorithm provides for embodiment of the present invention during entropy estimation filtering algorithm.See Fig. 3, the present embodiment based on median network traffics entropy evaluation method Storm platform make Storm cluster each use above-mentioned entropy to estimate from node filtering algorithm walks abreast calculates the estimated value of S in real time, then by the median of the estimated value using all S after sequence as the final estimated value of S and then according to described obtain the estimated value of the entropy of network data flow network traffics the accuracy rate of entropy estimation can be improved when not increasing memory space and computation complexity.
In another embody rule, see described in the present embodiment shown in Fig. 4 based on the Storm cluster topology schematic diagram of the network traffics entropy evaluation method of median, the network traffics entropy evaluation method based on median of the present embodiment also can make a part for Storm cluster use above-mentioned entropy to estimate approximate data from node at Storm platform, residue another part uses above-mentioned entropy to estimate filtering algorithm from node, the parallel estimated value calculating S in real time, then by the median of the estimated value using all S after sequence as the final estimated value of S and then according to described obtain the estimated value of the entropy of network data flow network traffics the accuracy rate of entropy estimation can be improved when not increasing memory space and computation complexity.
Will be understood that, the Qie Nuofu of famous multiplication form can be utilized theoretically to define reason (TheoremformultiplicativeformofChernoffbound) prove to use after above-mentioned default entropy estimating algorithm, the network traffics entropy evaluation method based on median of the present embodiment can improve the accuracy rate of entropy estimation when not increasing memory space and computation complexity:
Qie Nuofu defines reason: assuming that X 1, X 2..., X nbe separate with distribution, value is one group of stochastic variable of 0 or 1.Make X represent these stochastic variables and then the mathematic expectaion that μ=E [X] is X is made.Then for arbitrary arithmetic number 1> δ >0, have
Pr ( X ≤ ( 1 - δ ) μ ) ≤ e - δ 2 μ 2 .
Utilize Qie Nuofu to define reason, the correctness of the innovatory algorithm based on median can be proved.
First, suppose that described Storm cluster comprises n from node, we can be one group of independent identically distributed Bernoulli Jacob's variable X depending on n time operation result of this entropy estimation approximate data 1, X 2..., X n, work as X i=1, represent that the estimation of entropy is very accurate, work as X i=0, represent that the estimation of entropy is unacceptable.If Pr is (X i=1)=p, i=1,2 ..., n, then these Bernoulli Jacob's variablees and expectation be qie Nuofu according to multiplication form defines reason
Pr ( Σ i = 1 n X i ≤ n 2 ) = Pr ( Σ i = 1 n X i ≤ ( 1 - ( 1 - 1 2 p ) ) μ ) ≤ e - μ 2 ( 1 - 1 / 2 p ) 2 = e - n 2 p ( p - 1 / 2 ) 2 ;
Utilize the complementary theorem in probability theory, have following formula
Pr ( Σ i = 1 n X i > n 2 ) ≥ 1 - e - 1 2 p n ( p - 1 2 ) 2 ;
By the adjustment to random site number g × z and sample rate r, p>1/2 can be made.When n levels off to infinity, trend towards 1.That is, along with the increase running pass n, having more than n/2 result is that the probabilistic approximation of accurately estimation is in 1.
Then, our following proposition that will prove:
Run n and obtain n estimated value all over data flow algorithm, the median of this n estimated value is that the probability of accurate estimated value trends towards 1 along with the increase of n.
Prove: set the exact value of network traffics entropy as H, n estimated value is H 1, H 2..., H n, might as well H be established 1<H 2< ... <H n.According to H 1, H 2..., H nwith the relative position of H, three kinds of situation discussion can be divided.
Situation 1:H 1<H 2< ... <H i<H<H i+1< ... <H n;
Obviously, distance H is nearest individual estimated value is the most accurately individual estimated value.Due to the verified increase along with running pass n, have that to exceed half result be that the probabilistic approximation accurately estimated is in 1.Therefore, this individual estimated value is that the probability accurately estimated is tending towards 1.
Below, the median of n estimated value need only be proved whether belong to distance H nearest individual estimated value.Obviously, distance H is nearest individual estimated value is adjacent, might as well be set to H s<H s+1< ... <H t, wherein utilize reduction to absurdity, might as well median be established do not belong to distance H nearest individual estimated value.Obviously, or or if the then number of estimated value produce contradiction.If the then number of estimated value produce contradiction.Therefore, must have i.e. median distance H must be belonged to nearest individual estimated value.According to formula (7), along with the increase of n, have more than n/2 result be the probabilistic approximation accurately estimated in 1, this is equivalent to median that the probability accurately estimated levels off to 1.Proposition must be demonstrate,proved.
Situation 2:H<H 1<H 2< ... <H n
Obviously, now to estimate the most accurate (nearest from H) individual value is due to then median belong to estimation the most accurate individual value.According to formula (7), along with the increase of n, have more than n/2 result be the probabilistic approximation accurately estimated in 1, this is equivalent to median that the probability accurately estimated levels off to 1.Proposition must be demonstrate,proved.
Situation 3:H 1<H 2< ... <H n<H
Proof procedure is similar to situation 2.
Now, proposition card is finished.
According to above-mentioned proposition, we only need run above-mentioned default entropy estimating algorithm n time and get the median of this n estimated value, just greatly can increase and estimate probability accurately.But running n secondary data flow algorithm just needs to read n pass certificate, and namely data must be stored on hard disk, namely can not lose with crossing.The advantage of data flow algorithm will not exist.Further, the time needed for entropy estimation correspondingly increases n doubly, is difficult to accomplish real-time estimation.Method described in the present embodiment utilizes storm cluster, runs the above-mentioned default entropy estimating algorithm of n part concurrently, processing load is distributed on multiple machine, can improves the accuracy of estimation, can meet again the requirement of real-time estimation.
In addition, for above-mentioned entropy estimation approximate data, following some emulation experiments of doing are to prove that the network traffics entropy evaluation method based on median of the present embodiment can improve the accuracy rate of entropy estimation when not increasing memory space and computation complexity.
First, gather the flow of about 2 hours from Chinese education and scientific research computer network CERNET, be stored as the file of about 300Gb.Reset this file flow in Storm cluster by TCPReplay software, the scene of On-line Estimation entropy can be simulated.
Then, selected relative error ε=0.22, according to document, the z value of needs is approximately when getting m=13403402 (flows corresponding to about 5 minutes), the value of z is approximately 10000.In actual applications, relative error can be reduced further by the value increasing z.In addition, we get g=1, then the probability that estimated value meets relative error is about 68.4%.G=1 why is selected to have two reasons.The first, have a mind to force down and estimate probability (but ensure that p>1/2) accurately, obtain clearer displaying to make method described in the present embodiment relative to the improvements of above-mentioned entropy estimation approximate data.The second, g increases by 1 at every turn, just need increase by 10000 counters, and z value controls the error size of estimation, must be guaranteed.So in actual applications, should the size of control g as far as possible.Simulated experiment illustrates the thought utilizing this patent, even if get g=1, also can ensure the accuracy rate estimated.
According to formula, the accuracy rate of estimation be made to be greater than 0.95, n and to be at least 53.So we are according to the topology of Fig. 2, run the example of 70 parts of above-mentioned entropy estimation approximate datas concurrently, wherein the Output rusults of single instance is as shown in table 1.The exact value of the entropy that our hand computation goes out is 4.5, and therefore, for the relative evaluated error of 0.22, the limit of estimated value is 5.5, and the estimation being namely less than 5.5 can be thought accurately, acceptable, and the estimation being greater than 5.5 is considered to unacceptable.Table 1 is the entropy estimated value of single above-mentioned entropy estimation approximate data example, table 1 shows in the Output rusults of 70 bolts has 23 results to be unacceptable, if that is we take single pass to run the way of above-mentioned entropy estimation approximate data, the probability of making mistakes is 32.9%.And if we get the median of these 70 entropy estimated values, this value is 5.10, is less than 5.5, can be considered as estimating accurately.
Table 1
Sequence number Estimated value Can accept Sequence number Estimated value Can accept
1 4.56 Be 2 4.68 Be
3 4.68 Be 4 4.74 Be
5 4.75 Be 6 4.75 Be
7 4.77 Be 8 4.78 Be
9 4.78 Be 10 4.80 Be
11 4.80 Be 12 4.81 Be
13 4.82 Be 14 4.86 Be
15 4.86 Be 16 4.86 Be
17 4.88 Be 18 4.88 Be
19 4.90 Be 20 4.90 Be
21 4.90 Be 22 4.91 Be
23 4.91 Be 24 4.91 Be
25 4.95 Be 26 4.95 Be
27 4.97 Be 28 4.97 Be
29 5.00 Be 30 5.00 Be
31 5.00 Be 32 5.04 Be
33 5.07 Be 34 5.07 Be
35 5.10 Be 36 5.10 Be
37 5.15 Be 38 5.20 Be
39 5.20 Be 40 5.24 Be
41 5.26 Be 42 5.26 Be
43 5.29 Be 44 5.33 Be
45 5.33 Be 46 5.35 Be
47 5.38 Be 48 5.41 Be
49 5.44 Be 50 5.46 Be
51 5.50 No 52 5.52 No
53 5.56 No 54 5.58 No 10-->
55 5.60 No 56 5.60 No
57 5.61 No 58 5.62 No
59 5.62 No 60 5.64 No
61 5.65 No 62 5.66 No
63 5.69 No 64 5.69 No
65 5.71 No 66 5.71 No
67 5.73 No 68 5.75 No
69 5.75 No 70 5.80 No
Next, we repeat 60 above-mentioned experiments, and obtain 60 entropy based on median and estimate, wherein only have 2 estimated results to be unacceptable, namely accuracy rate has brought up to 96.7%.As can be seen here, the innovatory algorithm based on median that this patent proposes can improve the accuracy rate of estimation really greatly.In addition, the estimated value result of entropy all can provide within 5 minutes, illustrated that the data flow algorithm based on median can realize by means of storm cluster really in real time.
Be directed to the situation that algorithm two and algorithm one are combined with algorithm two, our experiment obtains similar result, repeats no more herein.
Fig. 5 shows the structural representation of a kind of network traffics entropy estimating device based on median that one embodiment of the invention provides, as shown in Figure 5, the network traffics entropy estimating device based on median of the present embodiment, comprising: the first acquisition module 51, receiver module 52, order module 53, second acquisition module 54 and the 3rd acquisition module 55;
First acquisition module 51, for obtaining the packet of network data flow, and is sent to each from node of Storm cluster;
Receiver module 52, for the estimated value of each middle entry S returned from node of receiving described Storm cluster, the estimated value of described S be described Storm cluster each from node according to the packet received, obtained by default entropy estimating algorithm;
Order module 53, the estimated value for the S by reception sorts according to size;
Second acquisition module 54, for obtaining the median of the estimated value of all S after sequence, using described median as the final estimated value of S
3rd acquisition module 55, described in basis obtain the estimated value of the entropy of described network data flow network traffics
Wherein, described Storm cluster comprises at least one from node.
In a particular application, described middle entry S is obtained by the first formulae discovery, and described first formula is:
S = 1 m &Sigma; i m i logm i ;
Wherein, the packet in described network data flow represents with i, m ifor the frequency that packet i occurs in network data flow, i ∈ 1,2 ..., 2 4b, b is the byte length of individual data bag in described network data flow.
In a particular application, described 4th acquisition module, can be specifically for
According to described by the second formula, obtain the estimated value of the entropy of network traffics
Described second formula is:
H ~ = l o g ( m ) - S ~ ,
Wherein, m is the total length of network data flow, and m is obtained by the 3rd formulae discovery;
Described 3rd formula is:
m = &Sigma; i = 1 n m i ,
N is positive integer.
The network traffics entropy estimating device based on median of the present embodiment, calculating in real time by Storm platform is parallel, can improve the accuracy rate of entropy estimation when not increasing memory space and computation complexity.
The network traffics entropy estimating device based on median of the present embodiment, may be used for the technical scheme performing embodiment of the method shown in earlier figures 1, it realizes principle and technique effect is similar, repeats no more herein.
" first " and " second " etc. are not make regulation to sequencing in embodiments of the present invention, just make difference to title, in embodiments of the present invention, do not make any restriction.
One of ordinary skill in the art will appreciate that: all or part of step realizing above-mentioned each embodiment of the method can have been come by the hardware that program command is relevant.Aforesaid program can be stored in a computer read/write memory medium.This program, when performing, performs the step comprising above-mentioned each embodiment of the method; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium.
Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims (10)

1., based on a network traffics entropy evaluation method for median, it is characterized in that, comprising:
Obtain the packet of network data flow, and be sent to each from node of Storm cluster;
The estimated value of each middle entry S returned from node receiving described Storm cluster, the estimated value of described S be described Storm cluster each from node according to the packet received, obtained by default entropy estimating algorithm;
The estimated value of the S of reception is sorted according to size;
Obtain the median of estimated value of all S after sequence, using described median as the final estimated value of S
According to described obtain the estimated value of the entropy of described network data flow network traffics
Wherein, described Storm cluster comprises at least one from node.
2. method according to claim 1, is characterized in that, described middle entry S is obtained by the first formulae discovery, and described first formula is:
S = 1 m &Sigma; i m i logm i ;
Wherein, the packet in described network data flow represents with i, m ifor the frequency that packet i occurs in network data flow, i ∈ 1,2 ..., 2 4b, b is the byte length of individual data bag in described network data flow.
3. method according to claim 2, is characterized in that, described in described basis obtain the estimated value of the entropy of network traffics comprise:
According to described by the second formula, obtain the estimated value of the entropy of network traffics
Described second formula is:
H ~ = log ( m ) - S ~ ,
Wherein, m is the total length of network data flow, and m is obtained by the 3rd formulae discovery;
Described 3rd formula is:
m = &Sigma; i = 1 n m i ,
N is positive integer.
4. method according to claim 3, is characterized in that, described default entropy estimating algorithm is entropy estimation approximate data, comprising:
Random selecting g × z position in described network data flow, generates random site set;
For the packet a of jth in described network data flow j=i, judges whether j belongs to described random site set, if j belongs to described random site set, then for i increases a counter, and the counter set be associated with i is all added 1; If j does not belong to described random site set, and there is the counter be associated with i, then the counter be associated with i is added 1;
After having judged whether the j of all packets in described network data flow belongs to described random site set, obtain the counter matrices C=(c of g × z pq);
According to described counter matrices C, build matrix X=(x pq);
Obtain the mean value of described each row element of matrix X;
Obtain the median of all mean value, and using the estimated value of this median as S;
Wherein, x pq=m × (c pqlogc pq-(c pq-1) log (c pq-1)), g = 2 log ( 1 &delta; ) , ε is relative evaluated error, and 1-δ is for estimating accuracy rate.
5. method according to claim 4, is characterized in that, the mean value of each row element of the described matrix X of described acquisition, comprising:
By the 4th formula, obtain the mean value avg [p] of described each row element of matrix X;
Wherein, described 4th formula is:
a v g &lsqb; p &rsqb; = 1 z &Sigma; q = 1 z x p q , p = 1 , 2 , ... , g ; q = 1 , 2 , ... , z .
6. method according to claim 3, is characterized in that, described default entropy estimating algorithm is entropy estimation filtering algorithm, comprising:
According to default sample rate, the packet in the network data flow obtained is sampled;
Judge a jth packet a in described network data flow jwhether=i is selected and whether there is the counter be associated with i;
If a jth packet a in network data flow j=i is selected, and there is not the counter be associated with i, then create a counter for i and i is labeled as rill;
If a jth packet a in network data flow j=i is selected, and there is the counter be associated with i, then i is labeled as large stream;
If a jth packet a in network data flow j=i is not selected, and there is the counter be associated with i, then the counter be associated with i is added 1;
Judging a jth packet a in network data flow jwhether=i is selected and after whether there is the counter that is associated with i, obtains being labeled as the counter matrices E of large stream and being labeled as the counter matrices M of rill;
According to the described counter matrices E being labeled as large stream, obtain the contribution margin S of large stream e;
According to the described counter matrices M being labeled as rill, build matrix Y=(y pq);
Obtain the mean value of described each row element of matrix Y;
Obtain the median of all mean value, and using the contribution margin S of this median as rill m;
According to described S eand S m, obtain the estimated value of S;
Wherein, E=(E t), t=1,2 ..., e; E is positive integer;
M=(M pq),p=1,2,…,g;q=1,2,…,z;
y pq=m×(M pqlogM pq-(M pq-1)log(M pq-1)),p=1,2,…,g;q=1,2,…,z。
7. method according to claim 6, is characterized in that, is labeled as the counter matrices E of large stream described in described basis, obtains the contribution margin S of large stream e, comprising:
According to the described counter matrices E being labeled as large stream, by the 5th formula, obtain the contribution margin S of large stream e;
Wherein, described 5th formula is:
S e = &Sigma; t = 1 e E t logE t , t = 1 , 2 , ... , e ;
And/or,
Described according to described S eand S m, obtain the estimated value of S, comprising:
According to described S eand S m, by the 6th formula, obtain the estimated value of S;
Wherein, described 6th formula is:
S ~ = S e + S m ;
And/or,
The mean value of each row element of the described matrix Y of described acquisition, comprising:
By the 7th formula, obtain the mean value avg [p] of described each row element of matrix Y;
Wherein, described 7th formula is:
a v g &lsqb; p &rsqb; = 1 z &Sigma; q = 1 z y p q , p = 1 , 2 , ... , g ; q = 1 , 2 , ... , z .
8., based on a network traffics entropy estimating device for median, it is characterized in that, comprising:
First acquisition module, for obtaining the packet of network data flow, and is sent to each from node of Storm cluster;
Receiver module, for the estimated value of each middle entry S returned from node of receiving described Storm cluster, the estimated value of described S be described Storm cluster each from node according to the packet received, obtained by default entropy estimating algorithm;
Order module, the estimated value for the S by reception sorts according to size;
Second acquisition module, for obtaining the median of the estimated value of all S after sequence, using described median as the final estimated value of S
3rd acquisition module, described in basis obtain the estimated value of the entropy of described network data flow network traffics
Wherein, described Storm cluster comprises at least one from node.
9. device according to claim 8, is characterized in that, described middle entry S is obtained by the first formulae discovery, and described first formula is:
S = 1 m &Sigma; i m i logm i ;
Wherein, the packet in described network data flow represents with i, m ifor the frequency that packet i occurs in network data flow, i ∈ 1,2 ..., 2 4b, b is the byte length of individual data bag in described network data flow.
10. device according to claim 9, is characterized in that, described 4th acquisition module, specifically for
According to described by the second formula, obtain the estimated value of the entropy of network traffics
Described second formula is:
H ~ = log ( m ) - S ~ ,
Wherein, m is the total length of network data flow, and m is obtained by the 3rd formulae discovery;
Described 3rd formula is:
m = &Sigma; i = 1 n m i ,
N is positive integer.
CN201510816499.8A 2015-11-23 2015-11-23 Network flow entropy evaluation method based on median and device Active CN105471639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510816499.8A CN105471639B (en) 2015-11-23 2015-11-23 Network flow entropy evaluation method based on median and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510816499.8A CN105471639B (en) 2015-11-23 2015-11-23 Network flow entropy evaluation method based on median and device

Publications (2)

Publication Number Publication Date
CN105471639A true CN105471639A (en) 2016-04-06
CN105471639B CN105471639B (en) 2018-07-27

Family

ID=55608953

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510816499.8A Active CN105471639B (en) 2015-11-23 2015-11-23 Network flow entropy evaluation method based on median and device

Country Status (1)

Country Link
CN (1) CN105471639B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645884A (en) * 2009-08-26 2010-02-10 西安理工大学 Multi-measure network abnormity detection method based on relative entropy theory
US7773538B2 (en) * 2008-05-16 2010-08-10 At&T Intellectual Property I, L.P. Estimating origin-destination flow entropy
CN101854404A (en) * 2010-06-04 2010-10-06 中国科学院计算机网络信息中心 Method and device for detecting anomaly of domain name system
CN102035698A (en) * 2011-01-06 2011-04-27 西北工业大学 HTTP tunnel detection method based on decision tree classification algorithm
CN102271091A (en) * 2011-09-06 2011-12-07 电子科技大学 Method for classifying network abnormal events

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7773538B2 (en) * 2008-05-16 2010-08-10 At&T Intellectual Property I, L.P. Estimating origin-destination flow entropy
CN101645884A (en) * 2009-08-26 2010-02-10 西安理工大学 Multi-measure network abnormity detection method based on relative entropy theory
CN101854404A (en) * 2010-06-04 2010-10-06 中国科学院计算机网络信息中心 Method and device for detecting anomaly of domain name system
CN102035698A (en) * 2011-01-06 2011-04-27 西北工业大学 HTTP tunnel detection method based on decision tree classification algorithm
CN102271091A (en) * 2011-09-06 2011-12-07 电子科技大学 Method for classifying network abnormal events

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ZIYU WANG, JIAHAI YANG, FULIANG LI: "An On-line Anomaly Detection Method Based on A New Stationary Metric—Entropy-Ratio", 《2014 IEEE 13TH INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS》 *

Also Published As

Publication number Publication date
CN105471639B (en) 2018-07-27

Similar Documents

Publication Publication Date Title
Censor-Hillel et al. Algebraic methods in the congested clique
Ghosh et al. Ergodic control of switching diffusions
CN103036792B (en) Transmitting and scheduling method for maximizing minimal equity multiple data streams
Cohen et al. Tighter estimation using bottom k sketches
CN105446979A (en) Data mining method and node
Rowe et al. Unbiased black box search algorithms
Xu et al. Fractional-order consensus of multi-agent systems with event-triggered control
Chen et al. Distinct counting with a self-learning bitmap
CN106649344B (en) Weblog compression method and device
CN104580017B (en) BlueDrama distribution method and system based on RSS
Klimenok et al. On the stationary distribution of tandem queue consisting of a finite number of stations
Censor-Hillel Distributed subgraph finding: progress and challenges
De Sterck et al. Algebraic multigrid for Markov chains
CN105471639A (en) Median-based network flow entropy evaluation method and apparatus
CN103746867B (en) A kind of network protocol analysis method based on basic function
Mazalov et al. Game-theoretic centrality measures for weighted graphs
CN102308554A (en) System and method for catching top hosts
Faizian et al. Throughput models of interconnection networks: the good, the bad, and the ugly
Makarov et al. On an Upper Bound for the Higher Exponent of a Linear Differential System with Integrable Perturbations on the Half-Line.
Gao et al. Cross entropy chance distribution model of uncertain random shortest path problem
Karimov et al. Models of network processes for describing operation of network protection tools
Juneja Efficient rare event simulation of stochastic systems
Shojaei et al. Confidentiality preserving integer programming for global routing
Chicano et al. Real-like MAX-SAT instances and the landscape structure across the phase transition
Jolad et al. Electron operator at the edge of the 1∕ 3 fractional quantum Hall liquid

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant