CN113949647B

CN113949647B - Host base number estimation method based on artificial neural network

Info

Publication number: CN113949647B
Application number: CN202111191513.1A
Authority: CN
Inventors: 徐杰; 鞠奥; 张玉健; 兰浩良; 印杰; 夏玲玲; 王群; 刘家银; 梁广俊; 诸葛程晨; 郭向民; 倪雪莉; 马如坡
Original assignee: JIANGSU POLICE INSTITUTE
Current assignee: JIANGSU POLICE INSTITUTE
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-08-29
Anticipated expiration: 2041-10-13
Also published as: CN113949647A

Abstract

The invention discloses a host base number estimation method based on an artificial neural network, which comprises the following steps: (1) Scanning IP addresses in all IP messages in a time window, and estimating the base number of each host in the intranet; (2) Sequencing and grouping each host, and grouping to construct a sampling host set; (3) Constructing a training attribute set trapX and a base estimation deviation set trapY of each sampling host; (4) Training an artificial neural network by utilizing the set trapnx and trapys; (5) And according to the trained artificial neural network, predicting the intranet host base estimation error, and adjusting the host base estimation value. According to the invention, a sampling IP sequence is constructed, then, data used in the process of estimating the base number of the host and the base number estimated value are used as attributes used in the process of correcting the base number, the relation between the base number estimated value and the base number is learned by combining an artificial neural network algorithm, the base number estimated value is adjusted by utilizing a learned model, and the accuracy of the base number estimated result of the host is further improved.

Description

Host base number estimation method based on artificial neural network

Technical Field

The invention relates to a network measurement technology, in particular to a host base number estimation method based on an artificial neural network.

Background

The host base is defined as the number of other different hosts communicating on an intranet host in the network within a time window, and detection of the host base is an important piece of current network measurement. In the prior art, the base number of an estimated host is calculated through a double connectivity framework algorithm (Double Connection Degree Sketch, DCDS), and all Linear Estimators (LE) corresponding to the host are found out first; and combining the LEs by adopting bit and operation, and estimating the base number of the host according to the combined LE estimation principle. The DCDS algorithm avoids storing all host opposite terminal IP information of the base number to be calculated in the memory, reduces the memory access times and the memory occupation quantity, and can be applied to high-speed network access super-point detection; however, the estimation result of the DCDS algorithm is affected by random factors, and a large error occurs. Factors affecting the accuracy of the estimated value of the DCDS algorithm comprise network flow distribution characteristics, hash parameters used by the algorithm and the like, and the DCDS algorithm does not consider the influence caused by the random factors and cannot guarantee the accuracy of the estimated result.

Disclosure of Invention

The invention aims to: in view of the above problems, an object of the present invention is to provide a host base estimation method based on an artificial neural network, which improves the accuracy of network host base estimation by using the artificial neural network.

The technical scheme is as follows: the invention discloses a host base number estimation method based on an artificial neural network, which comprises the following steps:

(1) Scanning IP addresses in all IP messages in a time window, and estimating the base number of each host in the intranet;

(2) Sequencing and grouping each host, and grouping to construct a sampling host set;

(3) Constructing a training attribute set trapX and a base estimation deviation set trapY of each sampling host;

(4) Training an artificial neural network by utilizing the set trapnx and trapys;

(5) And according to the trained artificial neural network, predicting the intranet host base estimation error, and adjusting the host base estimation value.

Further, in step (1), the linear estimator LE is updated according to the intranet IP address and the extranet IP address, and all the IP addresses appearing in the current time window are recorded;

estimating the cardinality of each host in the intranet includes:

all LEs recording the cardinality of the host aip in the linear estimator LE are combined according to a bit and method, the combined linear estimator LE is recorded as ule, the cardinality of the host aip is estimated from ule according to an LE cardinality calculation formula, and ule and the cardinality estimation value are saved.

Further, the network A and the network B are communicated through the router R, the intranet host to be estimated in the base number belongs to the network A, and the opposite end host of the intranet host belongs to the network B; the sampling host is a reference host with a determined base number, and is a host in a non-network A which is generated randomly, and the opposite end IP of each sampling host is a different IP address which is selected randomly from a network B;

in step (2), ordering and grouping each host, and constructing a sampling host set includes:

(201) The method comprises the steps of setting a set formed by intranet hosts as AIP, setting a set formed by opposite-end hosts as BIP, calculating the number n of IP addresses in the set AIP, and sequencing each host in the set AIP from small to large according to a base number estimated value to obtain a set AIP ' = { AIP ' ' ₁ ，aip′ ₂ ，aip′ ₃ ，...aip′ _n Let AIP' [ i ]]Representing the ith host in AIP ', AIP' _i ；EC(aip′ _i ) Representation aip' _i The host base estimate has EC (aip' _i )≤EC(aip′ _i+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Let the minimum value of radix estimation in the set AIP be expressed by minECThere is minec=ec (AIP' [1])；

(202) Grouping hosts in the set AIP', wherein each group is marked as a base estimation value group and is represented by ECG; let ECG [ j ] denote the jth ECG, each ECG contains a lower and upper bound, bt [ j ] and tp [ j ] denote the ECG [ j ] lower and upper bounds, respectively, the number of cardinal values included in the ECG is called the length of the ECG, denoted as egL, and the relationship between cardinal estimate packet length, upper bound, lower bound is: egL = tp-bt+1;

(203) Setting the cardinality estimation value packet number i=1, the current AIP' host index number i=1, and the current packet lower bound and upper bound are respectively: bt [ j ] =minec, tp [ j ] =bt [ j ] + egL-l;

(204) Reading the current host AIP' i radix estimation value and saving it to ec_a, comparing with the upper bound of the current packet, if ec_a > tp [ j ], going to step (206); otherwise, dividing the current host AIP' i into current radix estimate packets ECG j;

(205) Updating the index number of the current host AIP' [ i ], namely i=i+1; if i > n, go to step (206), otherwise go to step (204);

(206) Generating a sample IP sequence SIP [ j ] of ECG [ j ]; if i > n, go to step (207); otherwise, updating the base estimation value packet number, namely j=j+1, and setting the lower bound and the upper bound of the current packet as follows: cardinal estimates of bt [ j ] =aip' [ i ], tp [ j ] =bt [ j ] + egL-1; adding the current host AIP' [ i ] to the ECG [ j ], proceeding to step (205);

(207) All ECG and SIP are returned.

Further, in step (206), generating the sampled IP sequence SIP [ j ] of ECG [ j ] comprises the steps of:

(2061) Setting the actual basic value src of the sampling host as bt j-SS, wherein SS represents the sampling step length;

(2062) Generating SN sampling hosts, wherein each sampling host comprises src IP addresses randomly selected from a corresponding host set BIP, and adding the generated SN sampling hosts into the SIP [ j ]; where SN represents the number of samples;

(2063) Adding the SS to the actual base value of the current sampling host to update the actual base value, namely src=src+SS; if the updated actual base value src > tp [ j ], returning to SIP [ j ]; otherwise, go to step (2062).

Further, in step (3), constructing a training attribute set and a cardinal estimation deviation set for each sampling host includes:

(301) Sequentially scanning information of sampling hosts in all sampling IP sequences SIP, extracting an IP address sa, a base number rc and an opposite end IP set OIP of a current sampling host, searching r LEs corresponding to the sampling host sa in a matrix LEA, merging the r LEs according to a bit and method, and storing a merging result in ule;

(302) Mapping each opposite end host bip in the set OIP to one bit in ule, setting 1, calculating a base number estimated value ec of a sampling host according to ule, and adding a vector [ ec, ule ] to a training attribute set trapinX; calculating deviation between an actual base value rc and an estimated value ec of a sampling host, namely rc-ec, and adding the base estimation deviation [ rc-ec ] into a set track;

(303) Returning the trainX and the trainY of all the sampling hosts;

the matrix LEA is a matrix formed by r rows and c columns of LE and comprises opposite-end information of an intranet host.

Further, in step (4), training the neural network using the sets trainX and trainY includes:

carrying out one-dimensional convolution on each ule vector in the set trainX, then inputting the set trainY and the convolved set trainX as training data into an artificial neural network to carry out training network, and outputting the trained artificial neural network;

the artificial neural network comprises DenN Dense layers and DroN Dropout layers, and parameters DenN, droN and node number of each layer are set at the beginning of the algorithm.

Further, in step 5, predicting the intranet IP radix estimation error and adjusting the radix estimation value includes:

and (3) carrying out one-dimensional convolution on ule of each intranet host a, inputting the radix estimated value of the intranet host a and ule convolution results into the artificial neural network model trained in the step (4) to obtain a radix prediction error y ', adding the radix estimated value ec of the intranet host a and the prediction error y', and taking the adjusted rc '=ec+y' as the radix estimated value of the intranet host a.

The beneficial effects are that: compared with the prior art, the invention has the remarkable advantages that: according to the invention, a sampling IP sequence is constructed, then, data used in the process of estimating the base number of the host and the base number estimated value are used as attributes used in the process of correcting the base number, the relation between the base number estimated value and the base number is learned by combining an artificial neural network algorithm, the base number estimated value is adjusted by utilizing a learned model, and the accuracy of the base number estimated result of the host is further improved.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a network model;

FIG. 3 is a flowchart for generating a sampling IP sequence;

FIG. 4 is a flowchart for generating a sampling IP sequence;

FIG. 5 is an LE corresponding to an example sampling host sa;

FIG. 6 is a ULE after the sample host sa updates the peer IP;

FIG. 7 is a graph showing the comparison of the experimental results before and after adjustment.

Detailed Description

The host base estimation method based on the artificial neural network according to the embodiment, as shown in the flowchart of fig. 1, includes the following steps:

(1) And scanning IP addresses in all IP messages in a time window, and estimating the base number of each host in the intranet. The IP address comprises an intranet IP address and an extranet IP address, the linear estimator LE is updated according to the intranet IP address and the extranet IP address, and all IP addresses appearing in the current time window are recorded; all LEs recording the cardinality of the host aip in the linear estimator LE are combined according to a bit and method, the combined linear estimator LE is recorded as ule, the cardinality of the host aip is estimated from ule according to an LE cardinality calculation formula, and ule and the cardinality estimation value are saved.

As shown in the network model of FIG. 2, let network A and network B communicate through router R, let the host of the intranet of the cardinal number to be estimated belong to network A, in the figure, a1-a4 represent the host of the intranet, the host of the opposite terminal of the host of the intranet belongs to network B, such as B1, B2, B3, B6, B7, extract IP pair, know the cardinal number of the host of intranet of every time window according to LE cardinal number calculation formula, for example, cardinal numbers of host a1 and a2 are 1 and 2 in time window 1, cardinal numbers of host a1 and a3 are 2 and 1 in time window 2, cardinal numbers of host a3 and a4 are 2 and 1 respectively, save the cardinal numbers obtained. The sampling host is constructed, the sampling host is a reference host with a determined base number, the sampling host is a host in a non-network A through random generation, and the opposite end IP of each sampling host is a different IP address selected randomly from a network B.

(2) Each host is sequenced and grouped, and the grouping constructs a sampling host set, and the flow is as shown in fig. 3-4:

(201) The method comprises the steps of setting a set formed by intranet hosts as AIP, setting a set formed by opposite-end hosts as BIP, calculating the number n of IP addresses in the set AIP, and sequencing each host in the set AIP from small to large according to a base number estimated value to obtain a set AIP ' = { AIP ' ' ₁ ,aip′ ₂ ,aip′ ₃ ,…aip′ _n Let AIP' [ i ]]Representing the ith host in AIP ', AIP' _i ；EC(aip′ _i ) Representation aip' _i The host base estimate has EC (aip' _i )≤EC(aip′ _i+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Let the minimum value of radix estimates in the set AIP be expressed as minEC, then there is minec=ec (AIP' [1])；

(203) Setting the cardinality estimate packet number j=1, the current AIP' host index number i=1, and the current packet lower bound and upper bound are: bt [ j ] =minec, tp [ j ] =bt [ j ] + egL-1;

(204) Reading the current host AIP' [ i ] base estimation value and saving to ec_a, comparing with the upper bound of the current packet, if ec_a > tp [ j ], turning to step (206); otherwise, dividing the current host AIP' i into current radix estimate packets ECG j;

(207) All ECG and SIP are returned.

In step (206), generating a sampled IP sequence SIP [ j ] of ECG [ j ] comprises the steps of:

(303) Returning the trainX and the trainY of all the sampling hosts;

The implementation and effects of the method are further described by the following data. Assuming that there are 10 hosts a1-a10 of network A in a certain time window, table 1 lists each host and its cardinality estimate within the pre-ordering set AIP. After sorting from small to large, the corresponding cardinality estimates for each host in the collection AIP' are obtained, as shown in Table 2.

Table 1 network a host radix estimate

Network A host	a1	a2	a3	a4	a5	a6	a7	a8	a9	a10
											Radix estimation	3	4	3	5	9	10	13	9	10	15

Table 2 ordered network a hosts

Grouping the set AIP 'with a grouping length egL of 4, the lower bound bt of group 1 is the radix estimate of a'1, i.e., bt [1] =3, and the upper bound tp of group 1 is: tp [1] =bt [1] + egL-1=6, as can be seen from table 2, there are 4 hosts with radix estimates between 3 and 6, so ECG [1] contains 4 hosts, i.e., { a '1, a'2, a '3, a'4}; the host radix estimate 9 for a '5 is greater than tp [1], so taking the host radix estimate for a'5 as the lower bound bt for ECG [2], i.e., bt [2] =9, the upper bound tp [2] =bt [2] + egL-1=12 is calculated, ECG [2] contains 4 hosts { a '5, a'6, a '7, a'8}; using the host radix estimate 13 of a '9 as the lower bound bt of ECG [3], bt [3] = 13 and tp [3] = 16, the host { a'9, a '10} contained by ECG [3] totals the set AIP' into 3 radix estimate groupings.

After the packets are completed, a sampling IP sequence is generated for each packet. Let the sampling step size SS be 2 and the number of samples SN be 3, for the first cardinal estimate packet ECG [1], starting from cardinal estimates bt [1] -SS, SIP [1] contains: 3 sampling hosts with actual base estimation value of 1, 3 sampling hosts with actual base estimation value of 3, 3 sampling hosts with actual base estimation value of 5, and 3 sampling hosts with actual base estimation value of 7.

For the second cardinal estimate packet ECG [2], starting with cardinal estimate bt [2] -SS, SIP [2] contains: 3 sampling hosts with actual base estimation value of 7, 3 sampling hosts with actual base estimation value of 9, 3 sampling hosts with actual base estimation value of 11, and 3 sampling hosts with actual base estimation value of 13.

For the third cardinal estimate packet ECG [3], starting with cardinal estimate bt [3] -SS, SIP [3] contains: 3 sampling hosts with actual radix estimation value of 11, 3 sampling hosts with actual radix estimation value of 13, 3 sampling hosts with actual radix estimation value of 15, and 3 sampling hosts with actual radix estimation value of 17.

After the sampled IP sequence is generated, a radix estimate and a combined LE are computed for each sampling host in the sampled IP sequence. The LE matrix LEA used by the algorithm is assumed to consist of 3 rows and 6 columns of LEs, each LE containing 9 bits, the sampling host sa is assumed to have an actual radix of 3, containing 3 different peers IP { b1, b2, b3}, and the corresponding 3 LEs in the LEA are shown in fig. 5.

In fig. 5, LE 1, LE 2, and LE 3 are LEs corresponding to sa in the first, second, and third lines of the LEA, respectively; ule is LE obtained by combining LE 1, LE 2 and LE 3 according to bit AND mode. In ule there is one bit 1, the others are all 0. After combining LE, ule is updated with { b1, b2, b3} according to the LE algorithm. As shown in fig. 6, the number of 0 bits in the updated LE ule is 6, and according to the base calculation formula of the LE, the base estimation value is 3.6. And calculating the updated ule and base number estimated values for each sampling host according to the method, and obtaining the training data set.

After the training data set is obtained, artificial neural network training is carried out: the ule convolution is carried out firstly, and then the convolved result and the radix estimated value are input into the neural network together for training. Let the convolution kernel used be a vector of three 1 s, the convolution step size being 3. The ule convolution result of the sampling host sa is {0,1,2}. Subtracting the estimated radix 3.6 of sa from the actual radix 3 of sa yields an estimated deviation of-0.6. The convolution results {0,1,2}, radix estimate 3.6, and estimated bias-0.6 together constitute the content of the sa input artificial neural network.

After training the artificial neural network, each host computer to be estimated with the base value carries out one-dimensional convolution on ule according to the convolution method of step 5. After the convolution result is added with the original base number estimated value, the trained network is input to obtain the predicted value. The predicted value is added to the original cardinality estimate to obtain an adjusted cardinality estimate. The adjusted base number estimated value has higher accuracy due to the fact that the influence of random factors is removed.

In order to illustrate the effect of the invention, a group of high-speed network data is adopted for illustration, and the network data is from WIDE website @, the method is characterized in thathttp://mawi.wide.ad.jp/mawi/) The duration of the network data was 10 minutes, starting at 13:00 on 2018, 5, 9, and the set of data sets was represented by WIDE 20180509.

The estimated deviation (the actual base value minus the estimated base value) may reflect the accuracy of the estimation, and the present embodiment will analyze the results from the perspective of the estimated deviation. First, several metrics related to the estimated bias are defined, where AIP represents the number of hosts in AIP, rc_a represents the actual cardinality of host a, and ec_a represents the cardinality estimate of host a.

Definition 1 deviation ratio:

definition 2. Average deviation ratio:

definition 3. Average absolute deviation ratio:

the deviation rate represents the ratio of the estimated deviation rc_a-ec_a of the host a divided by the actual radix rc_a, removes the effect of the actual radix on the estimated error, and can compare the estimated deviations of different radix values. A good radix estimate should be close to the actual radix value, i.e. the deviation rate should be close to 0, and a good radix estimation algorithm should also have the average deviation rate close to 0. However, not all radix estimation algorithms with average deviation rates close to 0 have high radix estimation accuracy. For different intranet hosts, the base estimation value may be higher than the actual base value or lower than the actual base value, so that the estimation deviation may be positive or negative. When the intranet host base estimated deviation value is averaged, the negative estimated deviation counteracts the positive estimated deviation, so that the average value of the estimated deviation is reduced. Therefore, the present embodiment uses the average absolute deviation rate, that is, the average value of the absolute values of the deviations, to compare the accuracy of the estimation results.

In this embodiment, an LE matrix of 4 rows and 2048 columns is used, each LE contains 1024 bits, the length of the radix estimate packet is 100, the sampling step size of the sampling IP sequence corresponding to each radix estimate packet is 5, and the number of point samples is 20. In the neural network used in the experiment, when ule is convolved, the length of the convolution kernel used is 8; the convolutional layer is followed by the following hidden layers: a dense layer consisting of 129 nodes and employing Rectified Linear Unit (ReLU) as an activation function, a dropout layer with a drop rate of 0.1, a dense layer consisting of 256 nodes and having an activation function of ReLU with a drop rate of 0.1, and a dense layer consisting of 64 nodes and having an activation function of ReLU with a drop rate of 0.2. When training the artificial neural network, 20 iterations were performed, each iteration comprising 10 sub-iterations. The experimental results are shown in fig. 7 below.

Before the adjustment, the average absolute deviation rate of the cardinal number estimated values is 0.037359, and by the method, the average absolute deviation rate of the adjusted cardinal number estimated values is below 0.03 under different artificial neural network iteration times, and after 20 iterations, the average absolute deviation rate is reduced to 0.028133. Therefore, the host base estimation method can obviously improve the accuracy of host base estimation.

Claims

1. The host base estimation method based on the artificial neural network is characterized by comprising the following steps of:

(5) According to the trained artificial neural network, predicting an intranet host base estimation error, and adjusting a host base estimation value;

estimating the cardinality of each host in the intranet includes:

combining all LEs recording the base number of a host aip in the linear estimator LE according to a bit AND method, recording the combined linear estimator LE as ule, estimating the base number of the host aip from ule according to an LE base number calculation formula, and storing ule and a base number estimation value;

in step (2), ordering and grouping each host, and grouping to construct a sampling host set includes:

(201) The method comprises the steps that a set formed by intranet hosts is AIP, a set formed by opposite end hosts is BIP, the number n of IP addresses in the set AIP is calculated, each host in the set AIP is ordered from small to large according to a base number estimated value, and a set AIP' = { AIP is obtained ^′ ₁ ,aip ^′ ₂ ,aip ^′ ₃ ,…aip ^′ _n Let AIP' [ i ]]Representing the ith host in AIP', AIP _i ^′ ；EC(aip _i ^′ ) Representation aip _i ^′ The host base estimate has EC (aip) _i ^′ )≤EC(aip _i ^′ ₊₁ ) The method comprises the steps of carrying out a first treatment on the surface of the Let the minimum value of radix estimates in the set AIP be expressed as minEC, then there is minec=ec (AIP' [1])；

(202) Grouping hosts in the set AIP', wherein each group is marked as a base estimation value group and is represented by ECG; let ECG [ j ] denote the jth ECG, each ECG comprising a lower bound and an upper bound, bt [ j ] and tp [ j ] denote the ECG [ j ] lower bound and upper bound, respectively, the number of cardinal values included in the ECG being called the length of the ECG, denoted egL [ j ], the relationship between cardinal estimate packet length, upper bound, lower bound being: egL [ j ] = tp [ j ] -bt [ j ] +1;

(203) Setting the cardinality estimate packet number j=1, the current AIP' host index number i=1, and the current packet lower bound and upper bound are: bt [ j ] =minec, tp [ j ] =bt [ j ] + egL [ j ] -1;

(206) Generating a sample IP sequence SIP [ j ] of ECG [ j ]; if i > n, go to step (207); otherwise, updating the base estimation value packet number, namely j=j+1, and setting the lower bound and the upper bound of the current packet as follows: cardinal estimates of bt [ j ] =aip' [ i ], tp [ j ] =bt [ j ] + egL [ j ] -1; adding the current host AIP' [ i ] to the ECG [ j ], proceeding to step (205);

(207) Returning all ECG and SIP;

(2061) Setting a sampling host parameter src as bt [ j ] -SS, wherein SS represents a sampling step length;

(2063) Adding the current sampling host parameter src to the SS to update the current sampling host parameter src=src+SS; if the updated parameter src > tp [ j ], returning to SIP [ j ]; otherwise, go to step (2062);

in step (3), constructing a training attribute set and a radix estimation bias set of each sampling host includes:

(301) Sequentially scanning the information of the sampling host in all the sampling IP sequences SIP, extracting the IP address sa, the actual basic value rc and the opposite end IP set OIP of the current sampling host, searching r LEs corresponding to the sampling host sa in the matrix LEA, merging the r LEs according to the bit and the method, and storing the merging result in ule;

(302) Mapping each opposite end host bip in the set OIP to one bit in ule, setting 1, calculating a base number estimated value ec of a sampling host according to ule, and adding a vector [ ec, ule ] to a training attribute set trapinX; calculating the deviation between the actual base value rc and the base estimation value ec of the sampling host, namely rc-ec, and adding the base estimation deviation [ rc-ec ] into the set track;

(303) Returning the trainX and the trainY of all the sampling hosts;

the matrix LEA is a matrix formed by r rows and c columns of LE and comprises opposite-end information of an intranet host;

in step (4), training the artificial neural network using the sets trainX and trainY includes:

the artificial neural network comprises DenN Dense layers and DroN Dropout layers, and parameters DenN, droN and node numbers of each layer are set at the beginning of an algorithm;

in step (5), predicting the intranet IP radix estimation error and adjusting the radix estimation value includes:

and (3) carrying out one-dimensional convolution on ule of each intranet host, inputting the base estimation value of the intranet host and ule convolution results into the artificial neural network model trained in the step (4) to obtain a base prediction error y ', adding the base estimation value ec of the intranet host and the base prediction error y', and taking the adjusted rc '=ec+y' as the base estimation value of the intranet host.

2. The host base estimation method according to claim 1, wherein in step (1), the linear estimator LE is updated based on the intranet IP address and the extranet IP address, and all IP addresses present in the current time window are recorded.

3. The host base estimation method according to claim 2, wherein the network a and the network B communicate through the router R, the intranet host to be estimated is the network a, and the opposite host of the intranet host is the network B; the sampling host is a reference host with a determined base number, and is a host which is randomly generated and is not in the network A, and the opposite end IP of each sampling host is a different IP address which is randomly selected from the network B.