CN112866039A - Recursive domain name server user quantity estimation method based on passive DNS traffic - Google Patents

Recursive domain name server user quantity estimation method based on passive DNS traffic Download PDF

Info

Publication number
CN112866039A
CN112866039A CN202110254552.5A CN202110254552A CN112866039A CN 112866039 A CN112866039 A CN 112866039A CN 202110254552 A CN202110254552 A CN 202110254552A CN 112866039 A CN112866039 A CN 112866039A
Authority
CN
China
Prior art keywords
domain name
request
rdns
time
cache
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110254552.5A
Other languages
Chinese (zh)
Inventor
朱宇佳
黄彩云
刘庆云
谭建龙
杨嵘
李钊
窦凤虎
杨威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202110254552.5A priority Critical patent/CN112866039A/en
Publication of CN112866039A publication Critical patent/CN112866039A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a recursive domain name server user quantity estimation method based on passive DNS traffic, which relates to the technical field of network measurement, and can realize the estimation of all user quantity ranges using a recursive domain name server under the real network scene that the real source IP address of a user cannot be obtained through NAT address conversion and the quantity of original DNS request packets of the user is reduced after the RDNS cache compression of the recursive domain name server by analyzing the passive DNS traffic. And calculating the lower bound of the user quantity in the RDNS by using the DNS request packet after the RDNS cache compression. And the DNS request packet compressed by the RDNS cache is changed into a simulated DNS request packet initiated by an internal user by utilizing the DNS request packet and the response packet compressed by the RDNS cache, and then the generated simulated DNS request packet is utilized to calculate the upper bound of the internal user amount of the RDNS. The deviation range of the estimated user quantity and the real user quantity obtained by the method is small, and the effect is excellent.

Description

Recursive domain name server user quantity estimation method based on passive DNS traffic
Technical Field
The invention relates to the technical field of network measurement, in particular to a recursive domain name server user quantity estimation method based on passive DNS traffic.
Background
The current method for measuring the user quantity of a Recursive Domain Name Server (RDNS) can be mainly divided into two modes of passive measurement and active measurement:
the passive measurement is mainly realized by directly acquiring an original DNS request packet sent to the RDNS by a user, generating fingerprint information based on DNS from the original DNS request packet, matching the original DNS request packet to a specific single user by using the extracted fingerprint, then identifying and tracking each user, and finally calculating the user quantity. And may be classified into a supervised learning algorithm and an unsupervised learning algorithm according to the type of the algorithm to be implemented. Supervised learning based methods [1,2] require the use of a large number of DNS request packets with tagged users to generate fingerprint tag data sets for each user, but in real networks, obtaining such pre-tagged data sets is extremely challenging, fundamentally limiting the practical application effectiveness of such research methods. The method [3-7] based on unsupervised learning is not practical because the group classification number [3], namely the user amount, of the cluster is required to be predefined at first; or the clustering can be carried out only by using the real source IP address of the user as an identification label [4-7], and the requirement on the input DNS request packet is higher.
The active measurement is mainly to use the active cache probe which is constructed aiming at the RDNS to detect the cache behavior strategy of accessing the RDNS, characterize the cache condition in the RDNS by establishing the cache update interval time difference sequence of each domain name, further calculate the request time difference sequence of each domain name in the original DNS request packet sent to the RDNS by the user, and finally calculate the user quantity [9-12 ]. In order to calculate the request time difference sequence of each domain name in the original DNS request packet sent by the user to the RDNS, active measurement needs to consider that the original DNS request packet sent by the user to the RDNS and the corresponding DNS response packet form the cache of the RDNS, and the cache of the RDNS compresses the subsequent original DNS request packet sent by the user to the RDNS, and only if the original DNS request packet which does not hit the cache can be sent by the RDNS to other authoritative domain name servers, and if the original DNS request packet which hits the cache of the RDNS hits the original DNS request packet, the RDNS cache directly forms the corresponding DNS response packet and returns the DNS response packet to the user. Meanwhile, these methods [9-12] that use active measurement to obtain the sequence of the cache update interval time difference need to satisfy the network reachable condition between the active cache probe and the RDNS, and since the active cache probe needs to initiate a request for a specific domain name, the finally estimated number is not all users using the RDNS, but only the number of users using a specific domain name service.
In summary, the existing methods cannot completely and well solve the problem that the real source IP address of the user cannot be obtained after NAT address translation and the number of the original DNS request packets of the user is reduced after the RDNS cache compression in the real network scenario, and the amount of all users of the RDNS is estimated.
Reference documents:
[1]Dominik Herrmann,Christian Banse,and Hannes Federrath.2013.Behaviorbased tracking:exploiting characteristic patterns in DNS traffic.Computers&Security 39(2013),17–33.
[2]Dae Wook Kim and Junjie Zhang.2015.You are how you query:deriving behavioral fingerprints from DNS traffic.In Security and Privacy in Communication Networks,Bhavani Thuraisingham,Xiaofeng Wang,and Vinod Yegneswaran(Eds.).Springer International Publishing,348–366.
[3]Matthias Kirchler,Dominik Herrmann,Jens Lindemann,and Marius Kloft.2016.Tracked without a trace:linking sessions of users by unsupervised learning of patterns in their DNS traffic.In Proceedings of the 2016ACM Workshop on Artificial Intelligence and Security.23–34.
[4]Gang Chen,Haiying Zhang,and Caiming Xiong.2016.Maximum margin Dirichlet process mixtures for clustering.In Proceedings of the 30th AAAI Conference on Artificial Intelligence.1491–1497.
[5]Jinjin Guo and Zhiguo Gong.2016.A nonparametric model for event discovery in the geospatial-temporal space.In Proceedings of the 25th ACM International Conference on Information and Knowledge Management.499–508.
[6]Jianhua Yin and Jianyong Wang.2014.A Dirichlet multinomial mixture modelbased approach for short text clustering.In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.233–242.
[7]M.Kirchler,D.Herrmann,J.Lindemann,and M.Kloft,“Tracked Without a Trace:Linking Sessions of Users by Unsupervised Learning of Patterns in Their DNS Traffic,”in Proceedings of the2016ACM Workshop on Artificial Intelligence and Security,New York,NY,USA,2016,pp.23–34,doi:10.1145/2996758.2996770.
[8]M.Sun,G.Xu,J.Zhang,and D.W.Kim,“Tracking You Through DNS Traffic:Linking User Sessions by Clustering with Dirichlet Mixture Model,”in Proceedings of the 20th ACM International Conference on Modelling,Analysis and Simulation of Wireless and Mobile Systems,New York,NY,USA,2017,pp.303–310,doi:10.1145/3127540.3127567.
[9]Akcan H,Suel T,
Figure BDA0002963764460000021
H.Geographic web usage estimation by monitoring DNS caches[C]//Proceedings of the first international workshop on Location and the web.2008:85-92.
[10]Rajab M A,Monrose F,Terzis A,et al.Peeking through the cloud:DNS-based estimation and its applications[C]//International Conference on Applied Cryptography and Network Security.Springer,Berlin,Heidelberg,2008:21-38.
[11]X.Ma et al.,“Accurate DNS query characteristics estimation via active probing,”Journal of Network and Computer Applications,vol.47,pp.72–84,Jan.2015,doi:10.1016/j.jnca.2014.09.016.
[12]Schilling R L,Song R,Vondracek Z.Bernstein functions:theory and applications[M].Walter de Gruyter,2012.
disclosure of Invention
In order to overcome the defects of the existing method, the invention provides a recursive domain name server user quantity estimation method, which can realize the estimation of all user quantities using the RDNS under the real network scene that the real source IP address of the user cannot be obtained through NAT address conversion and the quantity of the original DNS request packets of the user is reduced after the RDNS cache compression by analyzing the passive DNS flow. The method is beneficial to evaluating the user quantity scale of the RDNS, calculating the importance degree of different RDNS, and preventing the private RDNS from being used for value-added services and the like.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a recursive domain name server user quantity estimation method based on passive DNS flow comprises the following steps:
1) acquiring DNS request packets sent by a certain Recursive Domain Name Server (RDNS) to other domain name servers and DNS response packets sent by other domain name servers to the RDNS;
2) according to the destination IP address of the DNS request packet and the source IP address of the DNS response packet, distinguishing the DNS request packet and the DNS response packet which are subjected to and not subjected to RDNS cache compression;
3) analyzing all DNS request packets which are subjected to and are not subjected to RDNS cache compression, extracting DNS request packet timestamps and request domain name fields, counting domain names and occurrence times thereof in each period of time according to the timestamps, clustering all DNS request packets according to the domain names and the occurrence times thereof to obtain the number of categories, and taking the number of the categories as a lower bound of a user quantity estimation value of the RDNS;
4) analyzing a DNS response packet compressed by an RDNS cache, extracting a request timestamp of each domain name and TTL (transistor-transistor logic) failure time of a corresponding resource record, counting a cache updating interval time difference sequence of each domain name in the RDNS, and obtaining a cumulative distribution function and parameters thereof of a request interval time difference sequence of each domain name for a user in the RDNS based on the condition that the request interval time difference sequence and the cache updating interval time difference sequence meet the assumption of a similar cumulative distribution function;
5) generating the starting time and the terminating time of a simulation request packet according to the DNS request packet which is cached and compressed by the RDNS, randomly generating a request interval time difference sequence of the simulation request packet of each domain within the range meeting the terminating time according to the cumulative distribution function of the request interval time difference sequence of each domain obtained in the step 4), and obtaining a simulation DNS request packet sent to the RDNS by an internal user according to the request interval time difference sequence of the simulation request packet of each domain;
6) analyzing a DNS request packet which is not subjected to RDNS cache compression and the simulated DNS request packet obtained in the step 5), extracting a request packet timestamp and a request domain name field, counting a domain name and the occurrence frequency thereof in each period of time according to the timestamp, clustering the DNS request packet which is not subjected to RDNS cache compression and the simulated DNS request packet according to the domain name and the occurrence frequency thereof to obtain the number of categories, and taking the number of the categories as the upper bound of the user quantity estimation value of the RDNS;
7) and estimating the internal user quantity using the RDNS according to the lower bound of the user quantity estimated value and the upper bound of the user quantity estimated value.
Further, in step 2), the DNS request packet and the DNS response packet are distinguished between the DNS request packet and the DNS response packet, which are compressed and not cached, according to whether the destination IP address of the DNS request packet and the source IP address of the DNS response packet are on the IP address list of the recursive domain name server.
Further, in the step 4), the DNS response packet compressed by the RDNS cache is analyzed, a start timestamp of a request domain name field of each domain name appearing in the DNS response packet and TTL live time of a resource record of a domain name corresponding to the analysis result in the DNS response packet are extracted, and the TTL fail time of the resource record corresponding to each domain name is obtained by adding the start timestamp to the TTL live time.
Further, constructing a simulated cache of the RDNS on each domain name in the step 4), taking each time interval from the start time stamp to the TTL failure time stamp of each domain name as a cache time interval, wherein each simulated cache time interval contains a cache start time and a cache end time; subtracting the caching ending time of the last caching time interval from the caching starting time of the first caching time interval of the domain name to obtain a first caching updating interval time difference; and calculating the cache updating interval time difference of all simulated cache time intervals of the domain name, and sequencing according to the time line to obtain a cache updating interval time difference sequence.
Further, in step 4), according to the cache update interval time difference sequence of the domain name, estimating a cumulative distribution function and parameters of the cache update interval time difference sequence of each domain name; and calculating the cumulative distribution function and parameters thereof of the request interval time difference sequence of the user initiating continuous requests aiming at a certain domain name in the RDNS according to the assumption that the request interval time difference sequence and the cache updating interval time difference sequence meet the similar cumulative distribution function.
Further, the method for estimating the cumulative distribution function of the cache update interval time difference sequence of each domain name and the parameters thereof in the step 4) comprises the following steps: and aiming at different distribution assumptions met by the cumulative distribution function of the cache updating interval time difference sequence of each domain name, respectively estimating different cumulative distribution functions and parameters thereof by adopting a maximum expectation estimation algorithm, or dynamically adjusting the super parameter values of the cumulative distribution functions by adopting an intelligent component maximum expectation estimation algorithm on the basis that the cumulative distribution function of the cache updating interval time difference sequence of each domain name meets super-exponential distribution function parameters to obtain the most appropriate cumulative distribution function and parameters thereof.
Further, the method for obtaining the cumulative distribution function of the request interval time difference sequence of each domain name and the parameters thereof in step 4) is as follows: based on the fact that the cumulative distribution function of the cache updating interval time difference sequence of each domain name and the cumulative distribution function of the corresponding request interval time difference sequence meet the similar distribution hypothesis, the parameters of the cumulative distribution function of the cache updating interval time difference sequence and the parameters of the request interval time difference sequence cumulative distribution function have an equation relationship of Laplace transform, the parameters of the cumulative distribution function of the cache updating interval time difference sequence are substituted into the equation, the parameters of the cumulative distribution function of the request interval time difference sequence are obtained through calculation of the Laplace inverse transform, and the cumulative distribution function and the parameters of the request interval time difference sequence of each domain name are obtained.
Further, the method for generating the start time and the end time of the simulation request packet in the step 5) comprises the following steps: and aiming at the DNS request packet after RDNS cache compression, extracting the starting time stamps of the first request packet and the last request packet of each domain name, taking the starting time stamp of the first request packet as the starting time for generating the simulation request packet, and taking the starting time stamp of the last request packet and the TTL failure time of the corresponding resource record as the termination time for generating the simulation request packet.
Further, the method for obtaining the simulated DNS request packet sent by the internal user to the RDNS according to the request interval time difference sequence of the simulated request packet for each domain name in step 5) is as follows: sequentially adding each time difference value of the simulation request interval time difference sequence from the starting time of the simulation request packet to obtain each time stamp representing the request time stamp of one simulation request packet until the obtained request time stamp exceeds the ending time of the simulation request packet requesting the domain name; and then summarizing the simulation request packet of each domain name according to the request domain name and the request timestamp, and performing ascending sequencing according to the request timestamp to obtain the simulation DNS request packet sent to the RDNS.
Further, the same clustering method is adopted in the step 3) and the step 6), and the clustering method comprises the following steps:
giving the similar domain names and the occurrence times thereof in each period of time to the same-class sip labels, and taking each period of time as a session unit;
setting clustering time intervals and session time intervals, wherein each clustering time interval comprises a plurality of session time intervals, and each session time interval comprises a plurality of session units;
randomly distributing an initialization category label to different types of sip labels found in each session time interval in a clustering time interval;
calculating the similarity probability between the current conversation unit and the existing conversation unit in each category by adopting Gibbs sampling, and allocating the label of the category with the highest probability to the current conversation unit, so that each conversation unit is calculated in sequence, and after all the conversation units are calculated once in turn, the initialized category label of each conversation unit is updated once to be used as an iteration;
and (5) performing iteration processing until the set maximum iteration times are met or the clustering category number is converged, and obtaining the category number.
The invention has the following beneficial effects:
1. the user quantity of the RDNS to be detected can be estimated under the real network scene that the real source IP address of the user cannot be obtained after NAT address conversion and the quantity of the original DNS request packets of the user is reduced after the RDNS to be detected is cached and compressed;
2. the DNS request packet compressed by the RDNS cache can be changed into a simulated DNS request packet initiated by an internal user;
3. the lower bound of the user quantity in the RDNS can be calculated by utilizing the DNS request packet after the RDNS cache compression;
4. the upper bound of the user quantity in the RDNS can be calculated by utilizing a DNS request packet after the RDNS caches compression and a simulated DNS request packet generated by a response packet;
5. in an experimental environment, the deviation range of the estimated user quantity and the real user quantity obtained by the method is mainly between-11% and 3.15%, wherein the lower bound deviation rate is mainly between-21.2% and-7.2%, and the upper bound deviation rate is mainly between-1.2% and 3%.
Drawings
FIG. 1 is a data processing flow diagram of the method of the present invention.
Fig. 2 is a diagram of a user amount estimation value calculation step.
Fig. 3 is a diagram of a cumulative distribution function estimation step for a sequence of request interval time differences.
Detailed Description
In order to make the technical solution of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.
The specific implementation mode of the invention is that the passive DNS flow of a certain RDNS to be tested in an ISP gateway is captured online, and is analyzed and stored into a DNS request log and a response log, and then the DNS request log and the response log are processed offline, so that the user quantity of the recursive domain name server to be tested is estimated by implementing the method of the invention.
As shown in fig. 1, a recursive domain name server user quantity estimation method based on passive DNS traffic includes the following steps:
step 101: and the traffic capture tool is deployed at the ISP gateway, acquires all DNS request and response traffic flowing in and out of the specified RDNS to be tested, analyzes the DNS request and response traffic, and stores the DNS request and response traffic into DNS request and response log files with fixed formats.
Step 102: and filtering all DNS request and response log files obtained in the step 101 by using a public recursive domain name server IP address list, and checking whether a destination IP address of a DNS request packet and a source IP address of a response packet are on the public recursive domain name server list or not, wherein the DNS request and response log files are used for distinguishing the DNS request and response log files which are subjected to RDNS cache compression in the graph from the DNS request and response log files which are not subjected to cache compression.
Step 103: acquiring all DNS request logs of the step 101, namely a set of DNS request logs sent by the RDNS after cache compression and DNS request logs which are not subjected to cache compression, performing ascending sequencing according to the time when a request packet appears, extracting a timestamp and a request domain name field to form input data in the figure 2, then performing calculation according to the step of the figure 2, and finally outputting a user quantity lower bound of the specified RDNS to be detected.
Fig. 2 shows a process of calculating a lower bound of the user amount of the specified RDNS to be tested by using the extracted timestamp and the request domain name field as input data, which specifically includes the following steps:
201: and extracting the time stamps and the request domain names of the DNS request logs sent by the cached and compressed RDNS obtained in the step 102 and the DNS request logs which are not cached and compressed.
202: and counting similar domain names and the occurrence times thereof within 1min according to the timestamp, and giving the same sip label as a session unit.
203: taking 1h as a session time interval, taking 1 day as a clustering time interval, and randomly allocating an initialization category label to different types of sip labels found in each session time interval in 1 clustering time interval.
204: calculating the similarity probability of a certain conversation unit and the existing conversation units in each category by adopting Gibbs sampling, and allocating the label of the category with the highest probability to the certain conversation unit; this process is applied to each session element in turn, and after all session elements are computed once in turn, the initialization class label of each session element is updated once as 1 iteration.
205: and step 204 is repeated until the set maximum iteration times are met, and finally the convergence of the clustering category number is achieved.
Step 104: extracting the start timestamp of the request domain name field of each domain name appearing in the response packet and the TTL survival time of the resource record of the resolution result corresponding to the domain name in the response data body from the DNS response log after the cache compression obtained in the step 102. The start timestamp plus the TTL time to live is the expiration timestamp of each domain name in the RDNS cache. And calculating the time difference between the failure time stamp of the last cache and the start time stamp of the next cache of each domain name to form an update interval time difference sequence of each domain name cache, namely obtaining the input data in the graph 3. Specifically, a simulated cache of the RDNS on each request domain name is constructed, that is, a plurality of time intervals from a start time stamp to a TTL failure time stamp of each request domain name indicate that a cache of a resource record corresponding to the request domain name exists on the RDNS in the time intervals; the more DNS request packets and response packets aiming at the requested domain name, the more simulation cache time intervals of the requested domain name on the RDNS are, and each simulation cache time interval has a cache starting time and a cache ending time; counting a cache update interval time difference sequence of each domain name: for each domain name, subtracting the cache ending time of the last cache time interval from the cache starting time of the current cache time interval in the analog cache of the domain name to obtain a time difference value, namely a so-called one-time cache updating time difference; the analog cache of the domain name has a plurality of cache time intervals, and the subtraction operation is carried out, so that a plurality of cache updating interval time differences which are 1 time less than the number of the cache time intervals can be obtained, and all the cache updating interval time differences are sorted according to a time line to form a cache updating interval time difference sequence of the domain name; and then, calculating according to the steps of fig. 3 to obtain a super-exponential distribution parameter of each domain name cache update interval time difference sequence, and finally calculating a super-exponential distribution parameter value of the request interval time difference sequence to obtain a cumulative distribution function of the request interval time difference sequence initiated by the internal user to the specified RDNS to be tested.
In this step, an algorithm for calculating the super-exponential distribution parameter of a domain name cache update interval time difference sequence according to an obtained domain name cache update interval time difference sequence as input data is shown in fig. 3, and the specific process is as follows:
301: the timestamp and request domain name from the cached DNS response log in step 102 are extracted.
302: counting the cache update interval time difference sequence of each domain name in the RDNS, and operating only on the cache update interval time difference sequence of one domain name in the following steps 303 to 304, and sequentially calculating each domain name.
303: and obtaining the cache update interval time difference sequence of a certain domain name obtained in the step 302, and estimating by using an intelligent component maximum expectation estimation algorithm to obtain a super-exponential distribution function parameter of the cache update interval time difference sequence.
304: according to the cumulative distribution similarity between the cache update interval time difference sequence and the request interval time difference sequence, the super-exponential distribution function parameter of the request interval time difference sequence is calculated by using the super-exponential distribution function parameter of the cache update interval time difference sequence obtained in step 303, so as to obtain the cumulative distribution function of the request interval time difference sequence.
Step 105: counting the DNS request logs after the cache compression in step 102 by using the domain name as an identifier, extracting the start time stamp of the first request and the last request of each domain name, taking the start time stamp of the first request as the start time for generating the simulation request log, and adding the TTL survival time of the resource record of the resolution result corresponding to the domain name obtained in step 104 to the start time stamp of the last request as the end time for generating the simulation request log. Then, according to the cumulative distribution function of the request interval time difference sequences obtained in step 104, the internal user simulation request interval time difference sequences meeting the time range are randomly generated, and in combination with the request domain name, the simulation request log of the internal user for the domain name is generated. And by utilizing the operation, sequentially calculating the simulation request logs of the internal users of each domain name, summarizing the simulation request logs of all the domain names, and performing increasing sequencing according to the time stamps to finally obtain the simulation DNS request log sent to the RDNS by the internal users.
Step 106: combining the simulated DNS request log sent to the RDNS by the internal user obtained in the step 105 with the time stamp and the domain name extracted from the DNS request log which is not subjected to cache compression and obtained in the step 102, forming the input data in the figure 2 after sorting according to the time stamp, synchronizing the step 103, calculating according to the step in the figure 2, and finally outputting the user quantity upper bound of the specified RDNS to be tested.
Step 107: and integrating the results of the step 103 and the step 106 to obtain the user quantity estimation range of the final specified RDNS service to be tested.
In order to test the positive effect of the invention, 3 DNS flow logs which have the same time range and are processed by different processes are prepared for observation in a controllable local area network, namely an original DNS request log of an internal user, a compressed DNS request log obtained after the internal user is compressed by an RDNS cache, and an internal user simulation DNS request log generated by the invention. And on an original DNS request log of an internal user, clustering by adopting a constrained Dirichlet polynomial hybrid clustering algorithm to obtain the number of categories as the real user quantity of the recursive domain name server. On the compressed DNS request log and the simulated DNS request log, the method is adopted for clustering, and the obtained category number is respectively used as the lower bound and the upper bound of the user quantity of the recursive domain name server. And finally, comparing the obtained upper and lower user quantity boundaries with the error range of the real user quantity value, and calculating the deviation ratio.
On the basis of the specific embodiment and the experimental method, a test experiment obtains DNS request and response logs obtained by analyzing all DNS traffic of internal users in a working day of 2 weeks through internal observation in a certain controllable local area network, wherein the specific dates are 2019.12.23-2019.12.27 for a single week 1 and 2020.1.6-2020.1.10 for a single week 2.
The comparison result of the user amount of the recursive domain name server obtained by clustering in the 2 weeks is shown in table 1, and because the original CDMM clustering algorithm needs to calculate the original DNS request logs of the users for multiple days together, only one result is obtained; the method of the invention adopts the DNS request log for a single day to calculate the upper and lower bound estimation of the user quantity of the day, and a result is obtained every day.
TABLE 1 comparison of user amounts
Figure BDA0002963764460000081
Figure BDA0002963764460000091
Calculating by using a deviation ratio (upper/lower user amount-real user amount)/100% of real user amount: if all 20 estimated values are seen, the deviation range of the estimated user quantity of the RDNS to be detected and the real user quantity of the RDNS is mainly-11-3.15%; and if the upper bound deviation ratio and the lower bound deviation ratio are viewed, the lower bound deviation ratio is mainly between-21.2% and-7.2%, and the upper bound deviation ratio is mainly between-1.2% and 3%.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (10)

1. A recursive domain name server user quantity estimation method based on passive DNS flow is characterized by comprising the following steps:
1) acquiring DNS request packets sent by a certain Recursive Domain Name Server (RDNS) to other domain name servers and DNS response packets sent by other domain name servers to the RDNS;
2) according to the destination IP address of the DNS request packet and the source IP address of the DNS response packet, distinguishing the DNS request packet and the DNS response packet which are subjected to and not subjected to RDNS cache compression;
3) analyzing all DNS request packets which are subjected to and are not subjected to RDNS cache compression, extracting DNS request packet timestamps and request domain name fields, counting domain names and occurrence times thereof in each period of time according to the timestamps, clustering all DNS request packets according to the domain names and the occurrence times thereof to obtain the number of categories, and taking the number of the categories as a lower bound of a user quantity estimation value of the RDNS;
4) analyzing a DNS response packet compressed by an RDNS cache, extracting a request timestamp of each domain name and TTL (transistor-transistor logic) failure time of a corresponding resource record, counting a cache updating interval time difference sequence of each domain name in the RDNS, and obtaining a cumulative distribution function and parameters thereof of a request interval time difference sequence of each domain name for a user in the RDNS based on the condition that the request interval time difference sequence and the cache updating interval time difference sequence meet the assumption of a similar cumulative distribution function;
5) generating the starting time and the terminating time of a simulation request packet according to the DNS request packet which is cached and compressed by the RDNS, randomly generating a request interval time difference sequence of the simulation request packet of each domain within the range meeting the terminating time according to the cumulative distribution function of the request interval time difference sequence of each domain obtained in the step 4), and obtaining a simulation DNS request packet sent to the RDNS by an internal user according to the request interval time difference sequence of the simulation request packet of each domain;
6) analyzing a DNS request packet which is not subjected to RDNS cache compression and the simulated DNS request packet obtained in the step 5), extracting a request packet timestamp and a request domain name field, counting a domain name and the occurrence frequency thereof in each period of time according to the timestamp, clustering the DNS request packet which is not subjected to RDNS cache compression and the simulated DNS request packet according to the domain name and the occurrence frequency thereof to obtain the number of categories, and taking the number of the categories as the upper bound of the user quantity estimation value of the RDNS;
7) and estimating the internal user quantity using the RDNS according to the lower bound of the user quantity estimated value and the upper bound of the user quantity estimated value.
2. The method of claim 1, wherein the DNS request packet and the DNS response packet, with and without cache compression, are distinguished in step 2) according to whether the destination IP address of the DNS request packet and the source IP address of the DNS response packet are on the IP address list of the recursive domain name server.
3. The method of claim 1, wherein the DNS response packet compressed by the RDNS cache in step 4) is parsed, a start timestamp of a request domain name field in which each domain name appears in the DNS response packet and a TTL time to live of a resource record corresponding to a resolution result of the domain name in the DNS response packet are extracted, and the TTL time to live is added to the start timestamp to obtain a TTL time to fail for the resource record corresponding to each domain name.
4. The method of claim 1, wherein the step 4) constructs an analog buffer of the RDNS on each domain name, and each time interval from the start time stamp to the TTL invalidation time stamp of each domain name is taken as a buffer time interval, and each analog buffer time interval comprises a buffer start time and a buffer end time; subtracting the caching ending time of the last caching time interval from the caching starting time of the first caching time interval of the domain name to obtain a first caching updating interval time difference; and calculating the cache updating interval time difference of all simulated cache time intervals of the domain name, and sequencing according to the time line to obtain a cache updating interval time difference sequence.
5. The method according to claim 1, wherein, in step 4), the cumulative distribution function of the cache update interval time difference sequence of each domain name and its parameters are estimated according to the cache update interval time difference sequence of the domain name; and calculating the cumulative distribution function and parameters thereof of the request interval time difference sequence of the user initiating continuous requests aiming at a certain domain name in the RDNS according to the assumption that the request interval time difference sequence and the cache updating interval time difference sequence meet the similar cumulative distribution function.
6. The method according to claim 5, wherein the method for estimating the cumulative distribution function of the cache update interval time difference sequence of each domain name and the parameters thereof in the step 4) comprises: and aiming at different distribution assumptions met by the cumulative distribution function of the cache updating interval time difference sequence of each domain name, respectively estimating different cumulative distribution functions and parameters thereof by adopting a maximum expectation estimation algorithm, or dynamically adjusting the super parameter values of the cumulative distribution functions by adopting an intelligent component maximum expectation estimation algorithm on the basis that the cumulative distribution function of the cache updating interval time difference sequence of each domain name meets super-exponential distribution function parameters to obtain the most appropriate cumulative distribution function and parameters thereof.
7. The method according to claim 1 or 6, wherein the method for obtaining the cumulative distribution function of the request interval time difference sequence of each domain name and the parameters thereof in the step 4) comprises: based on the assumption that the request interval time difference sequence and the cache update interval time difference sequence meet the similar cumulative distribution function, the parameters of the cumulative distribution function of the cache update interval time difference sequence and the parameters of the cumulative distribution function of the request interval time difference sequence have an equation relationship of Laplace transform, the parameters of the cumulative distribution function of the cache update interval time difference sequence are substituted into the equation, the parameters of the cumulative distribution function of the request interval time difference sequence are obtained by utilizing the Laplace inverse transform calculation, and the cumulative distribution function and the parameters of the request interval time difference sequence of each domain are obtained.
8. The method of claim 1, wherein the generating of the start time and the end time of the simulation request packet in the step 5) is performed by: and aiming at the DNS request packet after RDNS cache compression, extracting the starting time stamps of the first request packet and the last request packet of each domain name, taking the starting time stamp of the first request packet as the starting time for generating the simulation request packet, and taking the starting time stamp of the last request packet and the TTL failure time of the corresponding resource record as the termination time for generating the simulation request packet.
9. The method according to claim 1 or 8, wherein the step 5) of obtaining the simulated DNS request packet sent by the internal user to the RDNS according to the request interval time difference sequence of the simulated request packet for each domain name comprises: sequentially adding each time difference value of the simulation request interval time difference sequence from the starting time of the simulation request packet to obtain each time stamp representing the request time stamp of one simulation request packet until the obtained request time stamp exceeds the ending time of the simulation request packet requesting the domain name; and then summarizing the simulation request packet of each domain name according to the request domain name and the request timestamp, and performing ascending sequencing according to the request timestamp to obtain the simulation DNS request packet sent to the RDNS.
10. The method of claim 1, wherein the same clustering method is used in step 3) and step 6), the clustering method comprising the steps of:
giving the similar domain names and the occurrence times thereof in each period of time to the same-class sip labels, and taking each period of time as a session unit;
setting clustering time intervals and session time intervals, wherein each clustering time interval comprises a plurality of session time intervals, and each session time interval comprises a plurality of session units;
randomly distributing an initialization category label to different types of sip labels found in each session time interval in a clustering time interval;
calculating the similarity probability between the current conversation unit and the existing conversation unit in each category by adopting Gibbs sampling, and allocating the label of the category with the highest probability to the current conversation unit, so that each conversation unit is calculated in sequence, and after all the conversation units are calculated once in turn, the initialized category label of each conversation unit is updated once to be used as an iteration;
and (5) performing iteration processing until the set maximum iteration times are met or the clustering category number is converged, and obtaining the category number.
CN202110254552.5A 2021-03-05 2021-03-05 Recursive domain name server user quantity estimation method based on passive DNS traffic Pending CN112866039A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110254552.5A CN112866039A (en) 2021-03-05 2021-03-05 Recursive domain name server user quantity estimation method based on passive DNS traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110254552.5A CN112866039A (en) 2021-03-05 2021-03-05 Recursive domain name server user quantity estimation method based on passive DNS traffic

Publications (1)

Publication Number Publication Date
CN112866039A true CN112866039A (en) 2021-05-28

Family

ID=75993444

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110254552.5A Pending CN112866039A (en) 2021-03-05 2021-03-05 Recursive domain name server user quantity estimation method based on passive DNS traffic

Country Status (1)

Country Link
CN (1) CN112866039A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115883513A (en) * 2022-11-24 2023-03-31 中国科学院信息工程研究所 Resolver detection method based on DNS watermark technology and classification method thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130179555A1 (en) * 2012-01-10 2013-07-11 Thomson Licensing Method and device for timestamping data and method and device for verification of a timestamp
CN105376344A (en) * 2015-11-26 2016-03-02 中国互联网络信息中心 Method and system for analyzing recursive domain name server related to source address
CN105812204A (en) * 2016-03-14 2016-07-27 中国科学院信息工程研究所 Recursion domain name server online identification method based on connectivity estimation

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130179555A1 (en) * 2012-01-10 2013-07-11 Thomson Licensing Method and device for timestamping data and method and device for verification of a timestamp
CN105376344A (en) * 2015-11-26 2016-03-02 中国互联网络信息中心 Method and system for analyzing recursive domain name server related to source address
CN105812204A (en) * 2016-03-14 2016-07-27 中国科学院信息工程研究所 Recursion domain name server online identification method based on connectivity estimation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CAIYUN HUANG等: "How Many Users Behind A Local Recursive DNS Server? Estimated by Delta-Time Cluster Model", 《2020 IEEE 22ND INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115883513A (en) * 2022-11-24 2023-03-31 中国科学院信息工程研究所 Resolver detection method based on DNS watermark technology and classification method thereof

Similar Documents

Publication Publication Date Title
US20230128061A1 (en) Unsupervised encoder-decoder neural network security event detection
US10834106B2 (en) Network security event detection via normalized distance based clustering
US10129271B2 (en) Tracking users over network hosts based on user behavior
US10833954B2 (en) Extracting dependencies between network assets using deep learning
Najafimehr et al. A hybrid machine learning approach for detecting unprecedented DDoS attacks
CN109905288B (en) Application service classification method and device
US20200059431A1 (en) System and method for identifying devices behind network address translators
CN111953552B (en) Data flow classification method and message forwarding equipment
US9813442B2 (en) Server grouping system
US20230146382A1 (en) Network embeddings model for personal identifiable information protection
US7907543B2 (en) Apparatus and method for classifying network packet data
Li et al. Street-Level Landmarks Acquisition Based on SVM Classifiers.
Vieira et al. Model order selection and eigen similarity based framework for detection and identification of network attacks
Li et al. Can we learn what people are doing from raw DNS queries?
CN112866039A (en) Recursive domain name server user quantity estimation method based on passive DNS traffic
CN105447148B (en) A kind of Cookie mark correlating method and device
Shaman et al. User profiling based on application-level using network metadata
CN115037532B (en) Malicious domain name detection method based on heteromorphic image, electronic device and storage medium
Oudah et al. Using burstiness for network applications classification
Herrmann et al. Behavior-based tracking of Internet users with semi-supervised learning
US11218487B1 (en) Predictive entity resolution
JP6170001B2 (en) Communication service classification device, method and program
Gu et al. Fingerprinting Network Entities Based on Traffic Analysis in High‐Speed Network Environment
Qu et al. Who is DNS serving for? A human-software perspective of modeling DNS services
Kayacik et al. Generating representative traffic for intrusion detection system benchmarking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210528

WD01 Invention patent application deemed withdrawn after publication