CN113949647B - Host base number estimation method based on artificial neural network - Google Patents

Host base number estimation method based on artificial neural network Download PDF

Info

Publication number
CN113949647B
CN113949647B CN202111191513.1A CN202111191513A CN113949647B CN 113949647 B CN113949647 B CN 113949647B CN 202111191513 A CN202111191513 A CN 202111191513A CN 113949647 B CN113949647 B CN 113949647B
Authority
CN
China
Prior art keywords
host
aip
sampling
base
intranet
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111191513.1A
Other languages
Chinese (zh)
Other versions
CN113949647A (en
Inventor
徐杰
鞠奥
张玉健
兰浩良
印杰
夏玲玲
王群
刘家银
梁广俊
诸葛程晨
郭向民
倪雪莉
马如坡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU POLICE INSTITUTE
Original Assignee
JIANGSU POLICE INSTITUTE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU POLICE INSTITUTE filed Critical JIANGSU POLICE INSTITUTE
Priority to CN202111191513.1A priority Critical patent/CN113949647B/en
Publication of CN113949647A publication Critical patent/CN113949647A/en
Application granted granted Critical
Publication of CN113949647B publication Critical patent/CN113949647B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a host base number estimation method based on an artificial neural network, which comprises the following steps: (1) Scanning IP addresses in all IP messages in a time window, and estimating the base number of each host in the intranet; (2) Sequencing and grouping each host, and grouping to construct a sampling host set; (3) Constructing a training attribute set trapX and a base estimation deviation set trapY of each sampling host; (4) Training an artificial neural network by utilizing the set trapnx and trapys; (5) And according to the trained artificial neural network, predicting the intranet host base estimation error, and adjusting the host base estimation value. According to the invention, a sampling IP sequence is constructed, then, data used in the process of estimating the base number of the host and the base number estimated value are used as attributes used in the process of correcting the base number, the relation between the base number estimated value and the base number is learned by combining an artificial neural network algorithm, the base number estimated value is adjusted by utilizing a learned model, and the accuracy of the base number estimated result of the host is further improved.

Description

Host base number estimation method based on artificial neural network
Technical Field
The invention relates to a network measurement technology, in particular to a host base number estimation method based on an artificial neural network.
Background
The host base is defined as the number of other different hosts communicating on an intranet host in the network within a time window, and detection of the host base is an important piece of current network measurement. In the prior art, the base number of an estimated host is calculated through a double connectivity framework algorithm (Double Connection Degree Sketch, DCDS), and all Linear Estimators (LE) corresponding to the host are found out first; and combining the LEs by adopting bit and operation, and estimating the base number of the host according to the combined LE estimation principle. The DCDS algorithm avoids storing all host opposite terminal IP information of the base number to be calculated in the memory, reduces the memory access times and the memory occupation quantity, and can be applied to high-speed network access super-point detection; however, the estimation result of the DCDS algorithm is affected by random factors, and a large error occurs. Factors affecting the accuracy of the estimated value of the DCDS algorithm comprise network flow distribution characteristics, hash parameters used by the algorithm and the like, and the DCDS algorithm does not consider the influence caused by the random factors and cannot guarantee the accuracy of the estimated result.
Disclosure of Invention
The invention aims to: in view of the above problems, an object of the present invention is to provide a host base estimation method based on an artificial neural network, which improves the accuracy of network host base estimation by using the artificial neural network.
The technical scheme is as follows: the invention discloses a host base number estimation method based on an artificial neural network, which comprises the following steps:
(1) Scanning IP addresses in all IP messages in a time window, and estimating the base number of each host in the intranet;
(2) Sequencing and grouping each host, and grouping to construct a sampling host set;
(3) Constructing a training attribute set trapX and a base estimation deviation set trapY of each sampling host;
(4) Training an artificial neural network by utilizing the set trapnx and trapys;
(5) And according to the trained artificial neural network, predicting the intranet host base estimation error, and adjusting the host base estimation value.
Further, in step (1), the linear estimator LE is updated according to the intranet IP address and the extranet IP address, and all the IP addresses appearing in the current time window are recorded;
estimating the cardinality of each host in the intranet includes:
all LEs recording the cardinality of the host aip in the linear estimator LE are combined according to a bit and method, the combined linear estimator LE is recorded as ule, the cardinality of the host aip is estimated from ule according to an LE cardinality calculation formula, and ule and the cardinality estimation value are saved.
Further, the network A and the network B are communicated through the router R, the intranet host to be estimated in the base number belongs to the network A, and the opposite end host of the intranet host belongs to the network B; the sampling host is a reference host with a determined base number, and is a host in a non-network A which is generated randomly, and the opposite end IP of each sampling host is a different IP address which is selected randomly from a network B;
in step (2), ordering and grouping each host, and constructing a sampling host set includes:
(201) The method comprises the steps of setting a set formed by intranet hosts as AIP, setting a set formed by opposite-end hosts as BIP, calculating the number n of IP addresses in the set AIP, and sequencing each host in the set AIP from small to large according to a base number estimated value to obtain a set AIP ' = { AIP ' ' 1 ,aip′ 2 ,aip′ 3 ,...aip′ n Let AIP' [ i ]]Representing the ith host in AIP ', AIP' i ;EC(aip′ i ) Representation aip' i The host base estimate has EC (aip' i )≤EC(aip′ i+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Let the minimum value of radix estimation in the set AIP be expressed by minECThere is minec=ec (AIP' [1]);
(202) Grouping hosts in the set AIP', wherein each group is marked as a base estimation value group and is represented by ECG; let ECG [ j ] denote the jth ECG, each ECG contains a lower and upper bound, bt [ j ] and tp [ j ] denote the ECG [ j ] lower and upper bounds, respectively, the number of cardinal values included in the ECG is called the length of the ECG, denoted as egL, and the relationship between cardinal estimate packet length, upper bound, lower bound is: egL = tp-bt+1;
(203) Setting the cardinality estimation value packet number i=1, the current AIP' host index number i=1, and the current packet lower bound and upper bound are respectively: bt [ j ] =minec, tp [ j ] =bt [ j ] + egL-l;
(204) Reading the current host AIP' i radix estimation value and saving it to ec_a, comparing with the upper bound of the current packet, if ec_a > tp [ j ], going to step (206); otherwise, dividing the current host AIP' i into current radix estimate packets ECG j;
(205) Updating the index number of the current host AIP' [ i ], namely i=i+1; if i > n, go to step (206), otherwise go to step (204);
(206) Generating a sample IP sequence SIP [ j ] of ECG [ j ]; if i > n, go to step (207); otherwise, updating the base estimation value packet number, namely j=j+1, and setting the lower bound and the upper bound of the current packet as follows: cardinal estimates of bt [ j ] =aip' [ i ], tp [ j ] =bt [ j ] + egL-1; adding the current host AIP' [ i ] to the ECG [ j ], proceeding to step (205);
(207) All ECG and SIP are returned.
Further, in step (206), generating the sampled IP sequence SIP [ j ] of ECG [ j ] comprises the steps of:
(2061) Setting the actual basic value src of the sampling host as bt j-SS, wherein SS represents the sampling step length;
(2062) Generating SN sampling hosts, wherein each sampling host comprises src IP addresses randomly selected from a corresponding host set BIP, and adding the generated SN sampling hosts into the SIP [ j ]; where SN represents the number of samples;
(2063) Adding the SS to the actual base value of the current sampling host to update the actual base value, namely src=src+SS; if the updated actual base value src > tp [ j ], returning to SIP [ j ]; otherwise, go to step (2062).
Further, in step (3), constructing a training attribute set and a cardinal estimation deviation set for each sampling host includes:
(301) Sequentially scanning information of sampling hosts in all sampling IP sequences SIP, extracting an IP address sa, a base number rc and an opposite end IP set OIP of a current sampling host, searching r LEs corresponding to the sampling host sa in a matrix LEA, merging the r LEs according to a bit and method, and storing a merging result in ule;
(302) Mapping each opposite end host bip in the set OIP to one bit in ule, setting 1, calculating a base number estimated value ec of a sampling host according to ule, and adding a vector [ ec, ule ] to a training attribute set trapinX; calculating deviation between an actual base value rc and an estimated value ec of a sampling host, namely rc-ec, and adding the base estimation deviation [ rc-ec ] into a set track;
(303) Returning the trainX and the trainY of all the sampling hosts;
the matrix LEA is a matrix formed by r rows and c columns of LE and comprises opposite-end information of an intranet host.
Further, in step (4), training the neural network using the sets trainX and trainY includes:
carrying out one-dimensional convolution on each ule vector in the set trainX, then inputting the set trainY and the convolved set trainX as training data into an artificial neural network to carry out training network, and outputting the trained artificial neural network;
the artificial neural network comprises DenN Dense layers and DroN Dropout layers, and parameters DenN, droN and node number of each layer are set at the beginning of the algorithm.
Further, in step 5, predicting the intranet IP radix estimation error and adjusting the radix estimation value includes:
and (3) carrying out one-dimensional convolution on ule of each intranet host a, inputting the radix estimated value of the intranet host a and ule convolution results into the artificial neural network model trained in the step (4) to obtain a radix prediction error y ', adding the radix estimated value ec of the intranet host a and the prediction error y', and taking the adjusted rc '=ec+y' as the radix estimated value of the intranet host a.
The beneficial effects are that: compared with the prior art, the invention has the remarkable advantages that: according to the invention, a sampling IP sequence is constructed, then, data used in the process of estimating the base number of the host and the base number estimated value are used as attributes used in the process of correcting the base number, the relation between the base number estimated value and the base number is learned by combining an artificial neural network algorithm, the base number estimated value is adjusted by utilizing a learned model, and the accuracy of the base number estimated result of the host is further improved.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a network model;
FIG. 3 is a flowchart for generating a sampling IP sequence;
FIG. 4 is a flowchart for generating a sampling IP sequence;
FIG. 5 is an LE corresponding to an example sampling host sa;
FIG. 6 is a ULE after the sample host sa updates the peer IP;
FIG. 7 is a graph showing the comparison of the experimental results before and after adjustment.
Detailed Description
The host base estimation method based on the artificial neural network according to the embodiment, as shown in the flowchart of fig. 1, includes the following steps:
(1) And scanning IP addresses in all IP messages in a time window, and estimating the base number of each host in the intranet. The IP address comprises an intranet IP address and an extranet IP address, the linear estimator LE is updated according to the intranet IP address and the extranet IP address, and all IP addresses appearing in the current time window are recorded; all LEs recording the cardinality of the host aip in the linear estimator LE are combined according to a bit and method, the combined linear estimator LE is recorded as ule, the cardinality of the host aip is estimated from ule according to an LE cardinality calculation formula, and ule and the cardinality estimation value are saved.
As shown in the network model of FIG. 2, let network A and network B communicate through router R, let the host of the intranet of the cardinal number to be estimated belong to network A, in the figure, a1-a4 represent the host of the intranet, the host of the opposite terminal of the host of the intranet belongs to network B, such as B1, B2, B3, B6, B7, extract IP pair, know the cardinal number of the host of intranet of every time window according to LE cardinal number calculation formula, for example, cardinal numbers of host a1 and a2 are 1 and 2 in time window 1, cardinal numbers of host a1 and a3 are 2 and 1 in time window 2, cardinal numbers of host a3 and a4 are 2 and 1 respectively, save the cardinal numbers obtained. The sampling host is constructed, the sampling host is a reference host with a determined base number, the sampling host is a host in a non-network A through random generation, and the opposite end IP of each sampling host is a different IP address selected randomly from a network B.
(2) Each host is sequenced and grouped, and the grouping constructs a sampling host set, and the flow is as shown in fig. 3-4:
(201) The method comprises the steps of setting a set formed by intranet hosts as AIP, setting a set formed by opposite-end hosts as BIP, calculating the number n of IP addresses in the set AIP, and sequencing each host in the set AIP from small to large according to a base number estimated value to obtain a set AIP ' = { AIP ' ' 1 ,aip′ 2 ,aip′ 3 ,…aip′ n Let AIP' [ i ]]Representing the ith host in AIP ', AIP' i ;EC(aip′ i ) Representation aip' i The host base estimate has EC (aip' i )≤EC(aip′ i+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Let the minimum value of radix estimates in the set AIP be expressed as minEC, then there is minec=ec (AIP' [1]);
(202) Grouping hosts in the set AIP', wherein each group is marked as a base estimation value group and is represented by ECG; let ECG [ j ] denote the jth ECG, each ECG contains a lower and upper bound, bt [ j ] and tp [ j ] denote the ECG [ j ] lower and upper bounds, respectively, the number of cardinal values included in the ECG is called the length of the ECG, denoted as egL, and the relationship between cardinal estimate packet length, upper bound, lower bound is: egL = tp-bt+1;
(203) Setting the cardinality estimate packet number j=1, the current AIP' host index number i=1, and the current packet lower bound and upper bound are: bt [ j ] =minec, tp [ j ] =bt [ j ] + egL-1;
(204) Reading the current host AIP' [ i ] base estimation value and saving to ec_a, comparing with the upper bound of the current packet, if ec_a > tp [ j ], turning to step (206); otherwise, dividing the current host AIP' i into current radix estimate packets ECG j;
(205) Updating the index number of the current host AIP' [ i ], namely i=i+1; if i > n, go to step (206), otherwise go to step (204);
(206) Generating a sample IP sequence SIP [ j ] of ECG [ j ]; if i > n, go to step (207); otherwise, updating the base estimation value packet number, namely j=j+1, and setting the lower bound and the upper bound of the current packet as follows: cardinal estimates of bt [ j ] =aip' [ i ], tp [ j ] =bt [ j ] + egL-1; adding the current host AIP' [ i ] to the ECG [ j ], proceeding to step (205);
(207) All ECG and SIP are returned.
In step (206), generating a sampled IP sequence SIP [ j ] of ECG [ j ] comprises the steps of:
(2061) Setting the actual basic value src of the sampling host as bt j-SS, wherein SS represents the sampling step length;
(2062) Generating SN sampling hosts, wherein each sampling host comprises src IP addresses randomly selected from a corresponding host set BIP, and adding the generated SN sampling hosts into the SIP [ j ]; where SN represents the number of samples;
(2063) Adding the SS to the actual base value of the current sampling host to update the actual base value, namely src=src+SS; if the updated actual base value src > tp [ j ], returning to SIP [ j ]; otherwise, go to step (2062).
(3) Constructing a training attribute set trapX and a base estimation deviation set trapY of each sampling host;
(301) Sequentially scanning information of sampling hosts in all sampling IP sequences SIP, extracting an IP address sa, a base number rc and an opposite end IP set OIP of a current sampling host, searching r LEs corresponding to the sampling host sa in a matrix LEA, merging the r LEs according to a bit and method, and storing a merging result in ule;
(302) Mapping each opposite end host bip in the set OIP to one bit in ule, setting 1, calculating a base number estimated value ec of a sampling host according to ule, and adding a vector [ ec, ule ] to a training attribute set trapinX; calculating deviation between an actual base value rc and an estimated value ec of a sampling host, namely rc-ec, and adding the base estimation deviation [ rc-ec ] into a set track;
(303) Returning the trainX and the trainY of all the sampling hosts;
the matrix LEA is a matrix formed by r rows and c columns of LE and comprises opposite-end information of an intranet host.
(4) Training an artificial neural network by utilizing the set trapnx and trapys;
carrying out one-dimensional convolution on each ule vector in the set trainX, then inputting the set trainY and the convolved set trainX as training data into an artificial neural network to carry out training network, and outputting the trained artificial neural network;
the artificial neural network comprises DenN Dense layers and DroN Dropout layers, and parameters DenN, droN and node number of each layer are set at the beginning of the algorithm.
(5) And according to the trained artificial neural network, predicting the intranet host base estimation error, and adjusting the host base estimation value.
And (3) carrying out one-dimensional convolution on ule of each intranet host a, inputting the radix estimated value of the intranet host a and ule convolution results into the artificial neural network model trained in the step (4) to obtain a radix prediction error y ', adding the radix estimated value ec of the intranet host a and the prediction error y', and taking the adjusted rc '=ec+y' as the radix estimated value of the intranet host a.
The implementation and effects of the method are further described by the following data. Assuming that there are 10 hosts a1-a10 of network A in a certain time window, table 1 lists each host and its cardinality estimate within the pre-ordering set AIP. After sorting from small to large, the corresponding cardinality estimates for each host in the collection AIP' are obtained, as shown in Table 2.
Table 1 network a host radix estimate
Network A host a1 a2 a3 a4 a5 a6 a7 a8 a9 a10
Radix estimation 3 4 3 5 9 10 13 9 10 15
Table 2 ordered network a hosts
Grouping the set AIP 'with a grouping length egL of 4, the lower bound bt of group 1 is the radix estimate of a'1, i.e., bt [1] =3, and the upper bound tp of group 1 is: tp [1] =bt [1] + egL-1=6, as can be seen from table 2, there are 4 hosts with radix estimates between 3 and 6, so ECG [1] contains 4 hosts, i.e., { a '1, a'2, a '3, a'4}; the host radix estimate 9 for a '5 is greater than tp [1], so taking the host radix estimate for a'5 as the lower bound bt for ECG [2], i.e., bt [2] =9, the upper bound tp [2] =bt [2] + egL-1=12 is calculated, ECG [2] contains 4 hosts { a '5, a'6, a '7, a'8}; using the host radix estimate 13 of a '9 as the lower bound bt of ECG [3], bt [3] = 13 and tp [3] = 16, the host { a'9, a '10} contained by ECG [3] totals the set AIP' into 3 radix estimate groupings.
After the packets are completed, a sampling IP sequence is generated for each packet. Let the sampling step size SS be 2 and the number of samples SN be 3, for the first cardinal estimate packet ECG [1], starting from cardinal estimates bt [1] -SS, SIP [1] contains: 3 sampling hosts with actual base estimation value of 1, 3 sampling hosts with actual base estimation value of 3, 3 sampling hosts with actual base estimation value of 5, and 3 sampling hosts with actual base estimation value of 7.
For the second cardinal estimate packet ECG [2], starting with cardinal estimate bt [2] -SS, SIP [2] contains: 3 sampling hosts with actual base estimation value of 7, 3 sampling hosts with actual base estimation value of 9, 3 sampling hosts with actual base estimation value of 11, and 3 sampling hosts with actual base estimation value of 13.
For the third cardinal estimate packet ECG [3], starting with cardinal estimate bt [3] -SS, SIP [3] contains: 3 sampling hosts with actual radix estimation value of 11, 3 sampling hosts with actual radix estimation value of 13, 3 sampling hosts with actual radix estimation value of 15, and 3 sampling hosts with actual radix estimation value of 17.
After the sampled IP sequence is generated, a radix estimate and a combined LE are computed for each sampling host in the sampled IP sequence. The LE matrix LEA used by the algorithm is assumed to consist of 3 rows and 6 columns of LEs, each LE containing 9 bits, the sampling host sa is assumed to have an actual radix of 3, containing 3 different peers IP { b1, b2, b3}, and the corresponding 3 LEs in the LEA are shown in fig. 5.
In fig. 5, LE 1, LE 2, and LE 3 are LEs corresponding to sa in the first, second, and third lines of the LEA, respectively; ule is LE obtained by combining LE 1, LE 2 and LE 3 according to bit AND mode. In ule there is one bit 1, the others are all 0. After combining LE, ule is updated with { b1, b2, b3} according to the LE algorithm. As shown in fig. 6, the number of 0 bits in the updated LE ule is 6, and according to the base calculation formula of the LE, the base estimation value is 3.6. And calculating the updated ule and base number estimated values for each sampling host according to the method, and obtaining the training data set.
After the training data set is obtained, artificial neural network training is carried out: the ule convolution is carried out firstly, and then the convolved result and the radix estimated value are input into the neural network together for training. Let the convolution kernel used be a vector of three 1 s, the convolution step size being 3. The ule convolution result of the sampling host sa is {0,1,2}. Subtracting the estimated radix 3.6 of sa from the actual radix 3 of sa yields an estimated deviation of-0.6. The convolution results {0,1,2}, radix estimate 3.6, and estimated bias-0.6 together constitute the content of the sa input artificial neural network.
After training the artificial neural network, each host computer to be estimated with the base value carries out one-dimensional convolution on ule according to the convolution method of step 5. After the convolution result is added with the original base number estimated value, the trained network is input to obtain the predicted value. The predicted value is added to the original cardinality estimate to obtain an adjusted cardinality estimate. The adjusted base number estimated value has higher accuracy due to the fact that the influence of random factors is removed.
In order to illustrate the effect of the invention, a group of high-speed network data is adopted for illustration, and the network data is from WIDE website @, the method is characterized in thathttp://mawi.wide.ad.jp/mawi/) The duration of the network data was 10 minutes, starting at 13:00 on 2018, 5, 9, and the set of data sets was represented by WIDE 20180509.
The estimated deviation (the actual base value minus the estimated base value) may reflect the accuracy of the estimation, and the present embodiment will analyze the results from the perspective of the estimated deviation. First, several metrics related to the estimated bias are defined, where AIP represents the number of hosts in AIP, rc_a represents the actual cardinality of host a, and ec_a represents the cardinality estimate of host a.
Definition 1 deviation ratio:
definition 2. Average deviation ratio:
definition 3. Average absolute deviation ratio:
the deviation rate represents the ratio of the estimated deviation rc_a-ec_a of the host a divided by the actual radix rc_a, removes the effect of the actual radix on the estimated error, and can compare the estimated deviations of different radix values. A good radix estimate should be close to the actual radix value, i.e. the deviation rate should be close to 0, and a good radix estimation algorithm should also have the average deviation rate close to 0. However, not all radix estimation algorithms with average deviation rates close to 0 have high radix estimation accuracy. For different intranet hosts, the base estimation value may be higher than the actual base value or lower than the actual base value, so that the estimation deviation may be positive or negative. When the intranet host base estimated deviation value is averaged, the negative estimated deviation counteracts the positive estimated deviation, so that the average value of the estimated deviation is reduced. Therefore, the present embodiment uses the average absolute deviation rate, that is, the average value of the absolute values of the deviations, to compare the accuracy of the estimation results.
In this embodiment, an LE matrix of 4 rows and 2048 columns is used, each LE contains 1024 bits, the length of the radix estimate packet is 100, the sampling step size of the sampling IP sequence corresponding to each radix estimate packet is 5, and the number of point samples is 20. In the neural network used in the experiment, when ule is convolved, the length of the convolution kernel used is 8; the convolutional layer is followed by the following hidden layers: a dense layer consisting of 129 nodes and employing Rectified Linear Unit (ReLU) as an activation function, a dropout layer with a drop rate of 0.1, a dense layer consisting of 256 nodes and having an activation function of ReLU with a drop rate of 0.1, and a dense layer consisting of 64 nodes and having an activation function of ReLU with a drop rate of 0.2. When training the artificial neural network, 20 iterations were performed, each iteration comprising 10 sub-iterations. The experimental results are shown in fig. 7 below.
Before the adjustment, the average absolute deviation rate of the cardinal number estimated values is 0.037359, and by the method, the average absolute deviation rate of the adjusted cardinal number estimated values is below 0.03 under different artificial neural network iteration times, and after 20 iterations, the average absolute deviation rate is reduced to 0.028133. Therefore, the host base estimation method can obviously improve the accuracy of host base estimation.

Claims (3)

1. The host base estimation method based on the artificial neural network is characterized by comprising the following steps of:
(1) Scanning IP addresses in all IP messages in a time window, and estimating the base number of each host in the intranet;
(2) Sequencing and grouping each host, and grouping to construct a sampling host set;
(3) Constructing a training attribute set trapX and a base estimation deviation set trapY of each sampling host;
(4) Training an artificial neural network by utilizing the set trapnx and trapys;
(5) According to the trained artificial neural network, predicting an intranet host base estimation error, and adjusting a host base estimation value;
estimating the cardinality of each host in the intranet includes:
combining all LEs recording the base number of a host aip in the linear estimator LE according to a bit AND method, recording the combined linear estimator LE as ule, estimating the base number of the host aip from ule according to an LE base number calculation formula, and storing ule and a base number estimation value;
in step (2), ordering and grouping each host, and grouping to construct a sampling host set includes:
(201) The method comprises the steps that a set formed by intranet hosts is AIP, a set formed by opposite end hosts is BIP, the number n of IP addresses in the set AIP is calculated, each host in the set AIP is ordered from small to large according to a base number estimated value, and a set AIP' = { AIP is obtained 1 ,aip 2 ,aip 3 ,…aip n Let AIP' [ i ]]Representing the ith host in AIP', AIP i ;EC(aip i ) Representation aip i The host base estimate has EC (aip) i )≤EC(aip i +1 ) The method comprises the steps of carrying out a first treatment on the surface of the Let the minimum value of radix estimates in the set AIP be expressed as minEC, then there is minec=ec (AIP' [1]);
(202) Grouping hosts in the set AIP', wherein each group is marked as a base estimation value group and is represented by ECG; let ECG [ j ] denote the jth ECG, each ECG comprising a lower bound and an upper bound, bt [ j ] and tp [ j ] denote the ECG [ j ] lower bound and upper bound, respectively, the number of cardinal values included in the ECG being called the length of the ECG, denoted egL [ j ], the relationship between cardinal estimate packet length, upper bound, lower bound being: egL [ j ] = tp [ j ] -bt [ j ] +1;
(203) Setting the cardinality estimate packet number j=1, the current AIP' host index number i=1, and the current packet lower bound and upper bound are: bt [ j ] =minec, tp [ j ] =bt [ j ] + egL [ j ] -1;
(204) Reading the current host AIP' [ i ] base estimation value and saving to ec_a, comparing with the upper bound of the current packet, if ec_a > tp [ j ], turning to step (206); otherwise, dividing the current host AIP' i into current radix estimate packets ECG j;
(205) Updating the index number of the current host AIP' [ i ], namely i=i+1; if i > n, go to step (206), otherwise go to step (204);
(206) Generating a sample IP sequence SIP [ j ] of ECG [ j ]; if i > n, go to step (207); otherwise, updating the base estimation value packet number, namely j=j+1, and setting the lower bound and the upper bound of the current packet as follows: cardinal estimates of bt [ j ] =aip' [ i ], tp [ j ] =bt [ j ] + egL [ j ] -1; adding the current host AIP' [ i ] to the ECG [ j ], proceeding to step (205);
(207) Returning all ECG and SIP;
in step (206), generating a sampled IP sequence SIP [ j ] of ECG [ j ] comprises the steps of:
(2061) Setting a sampling host parameter src as bt [ j ] -SS, wherein SS represents a sampling step length;
(2062) Generating SN sampling hosts, wherein each sampling host comprises src IP addresses randomly selected from a corresponding host set BIP, and adding the generated SN sampling hosts into the SIP [ j ]; where SN represents the number of samples;
(2063) Adding the current sampling host parameter src to the SS to update the current sampling host parameter src=src+SS; if the updated parameter src > tp [ j ], returning to SIP [ j ]; otherwise, go to step (2062);
in step (3), constructing a training attribute set and a radix estimation bias set of each sampling host includes:
(301) Sequentially scanning the information of the sampling host in all the sampling IP sequences SIP, extracting the IP address sa, the actual basic value rc and the opposite end IP set OIP of the current sampling host, searching r LEs corresponding to the sampling host sa in the matrix LEA, merging the r LEs according to the bit and the method, and storing the merging result in ule;
(302) Mapping each opposite end host bip in the set OIP to one bit in ule, setting 1, calculating a base number estimated value ec of a sampling host according to ule, and adding a vector [ ec, ule ] to a training attribute set trapinX; calculating the deviation between the actual base value rc and the base estimation value ec of the sampling host, namely rc-ec, and adding the base estimation deviation [ rc-ec ] into the set track;
(303) Returning the trainX and the trainY of all the sampling hosts;
the matrix LEA is a matrix formed by r rows and c columns of LE and comprises opposite-end information of an intranet host;
in step (4), training the artificial neural network using the sets trainX and trainY includes:
carrying out one-dimensional convolution on each ule vector in the set trainX, then inputting the set trainY and the convolved set trainX as training data into an artificial neural network to carry out training network, and outputting the trained artificial neural network;
the artificial neural network comprises DenN Dense layers and DroN Dropout layers, and parameters DenN, droN and node numbers of each layer are set at the beginning of an algorithm;
in step (5), predicting the intranet IP radix estimation error and adjusting the radix estimation value includes:
and (3) carrying out one-dimensional convolution on ule of each intranet host, inputting the base estimation value of the intranet host and ule convolution results into the artificial neural network model trained in the step (4) to obtain a base prediction error y ', adding the base estimation value ec of the intranet host and the base prediction error y', and taking the adjusted rc '=ec+y' as the base estimation value of the intranet host.
2. The host base estimation method according to claim 1, wherein in step (1), the linear estimator LE is updated based on the intranet IP address and the extranet IP address, and all IP addresses present in the current time window are recorded.
3. The host base estimation method according to claim 2, wherein the network a and the network B communicate through the router R, the intranet host to be estimated is the network a, and the opposite host of the intranet host is the network B; the sampling host is a reference host with a determined base number, and is a host which is randomly generated and is not in the network A, and the opposite end IP of each sampling host is a different IP address which is randomly selected from the network B.
CN202111191513.1A 2021-10-13 2021-10-13 Host base number estimation method based on artificial neural network Active CN113949647B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111191513.1A CN113949647B (en) 2021-10-13 2021-10-13 Host base number estimation method based on artificial neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111191513.1A CN113949647B (en) 2021-10-13 2021-10-13 Host base number estimation method based on artificial neural network

Publications (2)

Publication Number Publication Date
CN113949647A CN113949647A (en) 2022-01-18
CN113949647B true CN113949647B (en) 2023-08-29

Family

ID=79330380

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111191513.1A Active CN113949647B (en) 2021-10-13 2021-10-13 Host base number estimation method based on artificial neural network

Country Status (1)

Country Link
CN (1) CN113949647B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110888859A (en) * 2019-11-01 2020-03-17 浙江大学 Connection cardinality estimation method based on combined deep neural network
US10942923B1 (en) * 2018-12-14 2021-03-09 Teradata Us, Inc. Deep learning for optimizer cardinality estimation
CN112883066A (en) * 2021-03-29 2021-06-01 电子科技大学 Multidimensional range query cardinality estimation method on database

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10942923B1 (en) * 2018-12-14 2021-03-09 Teradata Us, Inc. Deep learning for optimizer cardinality estimation
CN110888859A (en) * 2019-11-01 2020-03-17 浙江大学 Connection cardinality estimation method based on combined deep neural network
CN112883066A (en) * 2021-03-29 2021-06-01 电子科技大学 Multidimensional range query cardinality estimation method on database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基数估计算法参数的分析与优化;刘绍记等;《计算机科学》;20170215(第02期);279-282、301页 *

Also Published As

Publication number Publication date
CN113949647A (en) 2022-01-18

Similar Documents

Publication Publication Date Title
CN105354277B (en) Recommendation method and system based on recurrent neural network
CN110110094A (en) Across a network personage's correlating method based on social networks knowledge mapping
CN106897404B (en) Recommendation method and system based on multi-GRU layer neural network
KR20140024252A (en) Method and system for identifying rare-event failure rates
CN113807422B (en) Weighted graph convolutional neural network scoring prediction model integrating multi-feature information
CN112488791A (en) Individualized recommendation method based on knowledge graph convolution algorithm
CN103399858A (en) Socialization collaborative filtering recommendation method based on trust
CN102945333B (en) Key protein predicating method based on prior knowledge and network topology characteristics
CN108399268B (en) Incremental heterogeneous graph clustering method based on game theory
CN111182564A (en) Wireless link quality prediction method based on LSTM neural network
CN111259098B (en) Trajectory similarity calculation method based on sparse representation and Frechet distance fusion
CN116542720B (en) Time enhancement information sequence recommendation method and system based on graph convolution network
CN113949647B (en) Host base number estimation method based on artificial neural network
CN110011838B (en) Real-time tracking method for PageRank value of dynamic network
WO2018077301A1 (en) Account screening method and apparatus
CN114124734B (en) Network traffic prediction method based on GCN-Transformer integration model
CN114186518A (en) Integrated circuit yield estimation method and memory
CN110909303A (en) Adaptive space-time heterogeneity inverse distance interpolation method
CN114372561A (en) Network traffic prediction method based on depth state space model
CN114090860A (en) Method and system for determining importance of weighted network node
CN117093830A (en) User load data restoration method considering local and global
CN108400907B (en) Link packet loss rate reasoning method under uncertain network environment
CN116467466A (en) Knowledge graph-based code recommendation method, device, equipment and medium
CN110768825A (en) Service flow prediction method based on network big data analysis
CN113962591A (en) Industrial Internet of things data space access risk assessment method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant