CN110912749A - Method for predicting DNS data - Google Patents

Method for predicting DNS data Download PDF

Info

Publication number
CN110912749A
CN110912749A CN201911197916.XA CN201911197916A CN110912749A CN 110912749 A CN110912749 A CN 110912749A CN 201911197916 A CN201911197916 A CN 201911197916A CN 110912749 A CN110912749 A CN 110912749A
Authority
CN
China
Prior art keywords
data
tree
dns
prediction
regression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911197916.XA
Other languages
Chinese (zh)
Inventor
黄韬
吉星
鄂新华
潘恬
杨帆
谢人超
张娇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911197916.XA priority Critical patent/CN110912749A/en
Publication of CN110912749A publication Critical patent/CN110912749A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/147Network analysis or design for predicting network behaviour
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a method for predicting DNS data, which comprises the following steps: (1) collecting log data of the DNS by using tools such as a collector and the like; (2) preprocessing the acquired data according to the characteristic value of the acquired data; (3) performing dimensionality reduction on the acquired data according to the similarity; (4) classifying the data of the low-dimensional space; (5) predicting the classified low-dimensional data by using regression; (6) collecting the prediction results and analyzing trends. By extracting and processing the query log information in the DNS, the network traffic and the website security can be predicted.

Description

Method for predicting DNS data
Technical Field
The invention belongs to the field of computer network information, and particularly relates to a method for predicting DNS data.
Background
A DNS (Domain Name Server) is a Server that converts a Domain Name (Domain Name) and an IP address (IP address) corresponding to the Domain Name. The DNS stores a table of domain names and their corresponding IP addresses (IP addresses) to resolve the domain names of messages. After the domain name registration queries the domain name and purchases the host services, you need to resolve the domain name to the purchased host to see the website content. At present, the problem that the prediction of network traffic and website security cannot be made in the DNS network exists.
Disclosure of Invention
In view of the above technical problems, an object of the present invention is to provide a method for DNS data prediction, which can collect, preprocess, reduce dimensions, classify, regress, and analyze predictions for DNS data. The method can solve the problem of dimensionality disaster caused by high-dimensional data, and improve the prediction accuracy of the classification regression tree, so that the aspects of website traffic destination, website safety and the like can be analyzed.
A method for DNS data prediction, comprising the steps of:
collecting log data of the DNS by using tools such as a collector and the like;
preprocessing the acquired data according to the characteristic value of the acquired data;
performing dimensionality reduction on the acquired data according to the similarity;
classifying the data of the low-dimensional space;
predicting the classified low-dimensional data by using regression;
collecting the prediction results and analyzing trends.
Preferably, the collected information is a log of the DNS server, which includes start-up, restart, shutdown, output log, and message information.
Preferably, the data preprocessing operation comprises:
the data comprises; DNS request times in a source IP unit time, a peak value of the DNS request times, DNS request failure proportion, information entropy of a source port, information entropy of domain name types, a peak value of domain name type numbers, an illegal domain name proportion, an abnormal packet proportion and a server denial of service rate; the data preprocessing process sequentially comprises normalization and normalization processing; for the condition that the actual minimum value and the maximum value of the characteristic attribute are unknown, adopting a standard score to carry out standardization processing; all data were then normalized.
Preferably, the dimension reduction operation process on the data comprises the following steps: the similarity of mapping a high-dimensional space to a low-dimensional space is expressed by adopting conditional probability instead of Euclidean distance, the symmetry of two points is considered, the similarity between every two points is measured by using a Gaussian kernel function in the original high-dimensional space, the similarity between the two points is measured by using t distribution in the mapped low-dimensional space, and finally the average KL divergence is minimized by using a gradient descent method to obtain a gradient, so that the dimension reduction of data is realized.
Preferably, the classification operation process for the low-dimensional data includes: the classification operation is to divide the boundary of data and divide the data with different characteristics, the specific operation process is to give all the low-dimensional data and the corresponding classification marks, if the data is linearly separable, the hyperplane of the data is directly found out, and if the data is not linearly separable, the hyperplane is mapped to the n + 1-dimensional space to find out the hyperplane.
Preferably, the process of predicting data using regression includes: firstly, building a tree on classified data, finding out the optimal feature to be segmented of the data, judging whether the data can be segmented or not, if the data can not be segmented, setting the data as leaf nodes, if the data can be segmented, segmenting a data set into left and right subtrees according to the optimal feature to be segmented, and then building the tree on the left and right subtrees;
the process of finding out the optimal segmentation characteristic comprises the steps of calculating the error of data segmentation each time, and if the current error is smaller than the current minimum error, setting the current segmentation as the optimal segmentation and updating the minimum value;
the process of predicting based on the regression tree is that whether the current regression tree is a leaf node is judged, if yes, the prediction is carried out, if not, the characteristic value of the corresponding characteristic of the test data is compared with the current regression tree, if the characteristic value of the test data is large, whether the left sub-tree and the right sub-tree of the current regression tree are leaf nodes is judged, if yes, the prediction is carried out, and if not, the regression prediction is started from the left sub-tree and the right sub-tree.
According to the method for predicting the DNS data, the dimension of the processed data can be reduced, the low-dimensional data is classified, regression is constructed for prediction, and the activity and the safety of the internet access can be analyzed.
Drawings
FIG. 1 shows a flow chart of a method for DNS data prediction according to an embodiment of the invention
FIG. 2 is a block diagram illustrating a method for DNS data prediction according to an embodiment of the present invention
FIG. 3 is a flow chart illustrating a network organization for a method of DNS data prediction according to an embodiment of the present invention
Detailed Description
The following is a detailed description of embodiments of the invention, illustrated in the accompanying drawings in which like or similar reference numerals refer to the same or similar components or components having the same or similar functions throughout the several views. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of illustrating the present invention and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or "coupled". As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
As shown in fig. 1, an embodiment of the present invention is a method for DNS data prediction:
101: collecting log data of the DNS by using tools such as a collector and the like;
102: preprocessing the acquired data according to the characteristic value of the acquired data;
103: performing dimensionality reduction on the acquired data according to the similarity;
104: classifying the data of the low-dimensional space;
105: predicting the classified low-dimensional data by using regression;
106: collecting the prediction results and analyzing trends.
In step 101, collecting log data comprises:
the collected information is the log of the DNS server, which comprises information such as starting, restarting, closing, log outputting, message outputting and the like.
In step 102, the preprocessing operation procedure for the data includes:
attributes of DNS data include IP unit time, DNS request query time from DNS, unusual packet sharing, DNS request failures, source port entropy, domain name, information entropy, domain name peak, illegal domain name percentage peak, proportion of server denial of service.
The DNS raw data has several problems: inconsistency of data, data repetition, noise and high data dimensionality. The preprocessing of the data comprises several methods of data cleaning, data integration, data transformation and data reduction.
In step 103, the dimensionality reduction operation process on the data comprises the following steps:
the similarity of mapping a high-dimensional space to a low-dimensional space is expressed by adopting conditional probability instead of Euclidean distance, the symmetry of two points is considered, the similarity between every two points is measured by using a Gaussian kernel function in the original high-dimensional space, the similarity between the two points is measured by using t distribution in the mapped low-dimensional space, and finally the average KL divergence is minimized by using a gradient descent method to obtain a gradient, so that the dimension reduction of data is realized.
In step 104, the process of classifying the low dimensional data includes:
the classification operation aims at dividing the boundary of data and dividing the data with different characteristics, the specific operation process is to give all low-dimensional data and corresponding classification marks, if the data is linearly separable, the hyperplane of the data is directly found out, if the data is not linearly separable, the hyperplane is mapped to n + 1-dimensional space, and the hyperplane is found out, so that the expression of the hyperplane, namely a classification function, can be obtained. The low-dimensional data is classified.
In step 105, the process of predicting data using regression includes:
firstly, building a tree on classified data, finding out the optimal feature to be segmented of the data, judging whether the data can be segmented or not, if the data can not be segmented, setting the data as leaf nodes, if the data can be segmented, segmenting the data set into left and right subtrees according to the optimal feature to be segmented, and then building the tree on the left and right subtrees.
The process of finding out the optimal segmentation features comprises the steps of calculating the error of data segmentation each time, and if the current error is smaller than the current minimum error, setting the current segmentation as the optimal segmentation and updating the minimum value.
Secondly, the process of predicting based on the regression tree is that whether the current regression tree is a leaf node is judged, if yes, prediction is carried out, if not, the characteristic value of the corresponding characteristic of the test data is compared with the current regression tree, if the characteristic value of the test data is large, whether left and right subtrees of the current regression tree are leaf nodes is judged, if yes, prediction is carried out, and if not, regression prediction is carried out from the left and right subtrees.
In step 106, the prediction result analysis process includes:
the prediction comprises the steps of predicting the activity of website users and the safety of websites, predicting low-dimensional data subjected to classification processing through regression, and analyzing the low-dimensional data to play an important role in the aspects of advertisement use and safety.
Fig. 2 is a block diagram illustrating a method for predicting DNS data according to an embodiment of the present invention, wherein collecting DNS server data is a log of a DNS server, which includes taking information such as start-up, restart, shutdown, log output, and message. The preprocessing operation of the data comprises several methods of data cleaning, data integration, data transformation and data reduction. The dimensionality reduction operation process of the data mainly uses the conditional probability to replace the Euclidean distance to represent the similarity of the high-dimensional space mapped to the low-dimensional space so as to realize dimensionality reduction, and the main purpose is to eliminate redundancy and reduce the quantity of the processed data. The main purpose of the classification operation is to optimize the data regression effect. The regression tree is mainly used for predicting data, and then the visit volume of the website is analyzed according to the prediction result, so that the problems of advertisement delivery, website safety and the like of the website can be judged.
Fig. 3 shows a network organization flowchart of a method for predicting DNS data according to an embodiment of the present invention, where the method first uses a data collector and other tools to collect data such as log information of a DNS server, then preprocesses the data by data cleaning, integration, transformation, reduction, and other methods, then performs dimensionality reduction on the data to extract effective information and discard useless information, then performs classification on low-dimensional data, and can better construct a tree by classification, so that accuracy of data regression prediction is greatly improved, and then realizes prediction on the data by constructing a regression tree, and the predicted data has great effects on traffic analysis of a website and security of the website.

Claims (6)

1. A method for DNS data prediction, comprising the steps of:
collecting log data of the DNS by using tools such as a collector and the like;
preprocessing the acquired data according to the characteristic value of the acquired data;
performing dimensionality reduction on the acquired data according to the similarity;
classifying the data of the low-dimensional space;
predicting the classified low-dimensional data by using regression;
collecting the prediction results and analyzing trends.
2. The method of claim 1, wherein the collected information is a log of the DNS server, wherein the log includes start-up, restart, shutdown, log output, and message information.
3. The method of claim 1, wherein the data preprocessing operation comprises:
the data comprises; DNS request times in a source IP unit time, a peak value of the DNS request times, DNS request failure proportion, information entropy of a source port, information entropy of domain name types, a peak value of domain name type numbers, an illegal domain name proportion, an abnormal packet proportion and a server denial of service rate; the data preprocessing process sequentially comprises normalization and normalization processing; for the condition that the actual minimum value and the maximum value of the characteristic attribute are unknown, adopting a standard score to carry out standardization processing; all data were then normalized.
4. The method of claim 1, wherein the dimensionality reduction operation on the data comprises: the similarity of mapping a high-dimensional space to a low-dimensional space is expressed by adopting conditional probability instead of Euclidean distance, the symmetry of two points is considered, the similarity between every two points is measured by using a Gaussian kernel function in the original high-dimensional space, the similarity between the two points is measured by using t distribution in the mapped low-dimensional space, and finally the average KL divergence is minimized by using a gradient descent method to obtain a gradient, so that the dimension reduction of data is realized.
5. The method of claim 1, wherein the classifying operation on the low-dimensional data comprises: the classification operation is to divide the boundary of data and divide the data with different characteristics, the specific operation process is to give all the low-dimensional data and the corresponding classification marks, if the data is linearly separable, the hyperplane of the data is directly found out, and if the data is not linearly separable, the hyperplane is mapped to the n + 1-dimensional space to find out the hyperplane.
6. The method of claim 1, wherein using regression on the data to perform the prediction process comprises: firstly, building a tree on classified data, finding out the optimal feature to be segmented of the data, judging whether the data can be segmented or not, if the data can not be segmented, setting the data as leaf nodes, if the data can be segmented, segmenting a data set into left and right subtrees according to the optimal feature to be segmented, and then building the tree on the left and right subtrees;
the process of finding out the optimal segmentation characteristic comprises the steps of calculating the error of data segmentation each time, and if the current error is smaller than the current minimum error, setting the current segmentation as the optimal segmentation and updating the minimum value;
the process of predicting based on the regression tree is that whether the current regression tree is a leaf node is judged, if yes, the prediction is carried out, if not, the characteristic value of the corresponding characteristic of the test data is compared with the current regression tree, if the characteristic value of the test data is large, whether the left sub-tree and the right sub-tree of the current regression tree are leaf nodes is judged, if yes, the prediction is carried out, and if not, the regression prediction is started from the left sub-tree and the right sub-tree.
CN201911197916.XA 2019-11-29 2019-11-29 Method for predicting DNS data Pending CN110912749A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911197916.XA CN110912749A (en) 2019-11-29 2019-11-29 Method for predicting DNS data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911197916.XA CN110912749A (en) 2019-11-29 2019-11-29 Method for predicting DNS data

Publications (1)

Publication Number Publication Date
CN110912749A true CN110912749A (en) 2020-03-24

Family

ID=69820483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911197916.XA Pending CN110912749A (en) 2019-11-29 2019-11-29 Method for predicting DNS data

Country Status (1)

Country Link
CN (1) CN110912749A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003596A (en) * 2021-11-16 2022-02-01 国家工业信息安全发展研究中心 Multi-source heterogeneous data processing system and method based on industrial system
CN116016220A (en) * 2022-12-23 2023-04-25 天翼安全科技有限公司 Method, device and equipment for predicting service traffic based on DNS traffic

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034474A (en) * 2018-07-26 2018-12-18 北京航空航天大学 It is a kind of to be clustered and regression analysis and system based on the subway station of POI data and passenger flow data
CN110458425A (en) * 2019-07-25 2019-11-15 腾讯科技(深圳)有限公司 Risk analysis method, device, readable medium and the electronic equipment of risk subject

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109034474A (en) * 2018-07-26 2018-12-18 北京航空航天大学 It is a kind of to be clustered and regression analysis and system based on the subway station of POI data and passenger flow data
CN110458425A (en) * 2019-07-25 2019-11-15 腾讯科技(深圳)有限公司 Risk analysis method, device, readable medium and the electronic equipment of risk subject

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
IVYYIN: "t-SNE算法", 《CSDN》 *
YAOYZ105: "机器学习算法(一)SVM", 《CSDN》 *
吉星,黄韬,鄂新华,孙礼: "基于日志信息的DNS查询异常检测算法", 《北京邮电大学学报》 *
张晶: "基于AdaBoost回归树的多目标预测算法的研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003596A (en) * 2021-11-16 2022-02-01 国家工业信息安全发展研究中心 Multi-source heterogeneous data processing system and method based on industrial system
CN114003596B (en) * 2021-11-16 2022-07-12 国家工业信息安全发展研究中心 Multi-source heterogeneous data processing system and method based on industrial system
CN116016220A (en) * 2022-12-23 2023-04-25 天翼安全科技有限公司 Method, device and equipment for predicting service traffic based on DNS traffic

Similar Documents

Publication Publication Date Title
US11816078B2 (en) Automatic entity resolution with rules detection and generation system
WO2018176874A1 (en) Dns evaluation method and apparatus
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US20170230393A1 (en) Systems and methods for traffic classification
CN111212053B (en) Industrial control honeypot-oriented homologous attack analysis method
US20160065534A1 (en) System for correlation of domain names
US10404731B2 (en) Method and device for detecting website attack
CN113328985B (en) Passive Internet of things equipment identification method, system, medium and equipment
CN105704259B (en) A kind of domain name authority services source IP recognition methods and system
WO2020176269A1 (en) System and method for file artifact metadata collection and analysis
CN110912749A (en) Method for predicting DNS data
KR102425525B1 (en) System and method for log anomaly detection using bayesian probability and closed pattern mining method and computer program for the same
US20200242488A1 (en) Systems and methods for crowdsourcing device recognition
GB2569678A (en) Automation of SQL tuning method and system using statistic SQL pattern analysis
KR100906454B1 (en) Database log data management apparatus and method thereof
CN116032741A (en) Equipment identification method and device, electronic equipment and computer storage medium
CN112054992B (en) Malicious traffic identification method and device, electronic equipment and storage medium
CN112199388A (en) Strange call identification method and device, electronic equipment and storage medium
CN111431884A (en) Host computer defect detection method and device based on DNS analysis
CN113688240B (en) Threat element extraction method, threat element extraction device, threat element extraction equipment and storage medium
CN115001724B (en) Network threat intelligence management method, device, computing equipment and computer readable storage medium
CN114363039A (en) Method, device, equipment and storage medium for identifying fraud websites
CN112564928B (en) Service classification method and device and Internet system
WO2023063972A1 (en) Records matching techniques for facilitating database search and fragmented record detection
WO2023063970A1 (en) Records matching techniques for facilitating database search and fragmented record detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200324