CN110912749A

CN110912749A - Method for predicting DNS data

Info

Publication number: CN110912749A
Application number: CN201911197916.XA
Authority: CN
Inventors: 黄韬; 吉星; 鄂新华; 潘恬; 杨帆; 谢人超; 张娇
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-03-24

Abstract

The invention discloses a method for predicting DNS data, which comprises the following steps: (1) collecting log data of the DNS by using tools such as a collector and the like; (2) preprocessing the acquired data according to the characteristic value of the acquired data; (3) performing dimensionality reduction on the acquired data according to the similarity; (4) classifying the data of the low-dimensional space; (5) predicting the classified low-dimensional data by using regression; (6) collecting the prediction results and analyzing trends. By extracting and processing the query log information in the DNS, the network traffic and the website security can be predicted.

Description

Method for predicting DNS data

Technical Field

The invention belongs to the field of computer network information, and particularly relates to a method for predicting DNS data.

Background

A DNS (Domain Name Server) is a Server that converts a Domain Name (Domain Name) and an IP address (IP address) corresponding to the Domain Name. The DNS stores a table of domain names and their corresponding IP addresses (IP addresses) to resolve the domain names of messages. After the domain name registration queries the domain name and purchases the host services, you need to resolve the domain name to the purchased host to see the website content. At present, the problem that the prediction of network traffic and website security cannot be made in the DNS network exists.

Disclosure of Invention

In view of the above technical problems, an object of the present invention is to provide a method for DNS data prediction, which can collect, preprocess, reduce dimensions, classify, regress, and analyze predictions for DNS data. The method can solve the problem of dimensionality disaster caused by high-dimensional data, and improve the prediction accuracy of the classification regression tree, so that the aspects of website traffic destination, website safety and the like can be analyzed.

A method for DNS data prediction, comprising the steps of:

collecting log data of the DNS by using tools such as a collector and the like;

preprocessing the acquired data according to the characteristic value of the acquired data;

performing dimensionality reduction on the acquired data according to the similarity;

classifying the data of the low-dimensional space;

predicting the classified low-dimensional data by using regression;

collecting the prediction results and analyzing trends.

Preferably, the collected information is a log of the DNS server, which includes start-up, restart, shutdown, output log, and message information.

Preferably, the data preprocessing operation comprises:

the data comprises; DNS request times in a source IP unit time, a peak value of the DNS request times, DNS request failure proportion, information entropy of a source port, information entropy of domain name types, a peak value of domain name type numbers, an illegal domain name proportion, an abnormal packet proportion and a server denial of service rate; the data preprocessing process sequentially comprises normalization and normalization processing; for the condition that the actual minimum value and the maximum value of the characteristic attribute are unknown, adopting a standard score to carry out standardization processing; all data were then normalized.

Preferably, the dimension reduction operation process on the data comprises the following steps: the similarity of mapping a high-dimensional space to a low-dimensional space is expressed by adopting conditional probability instead of Euclidean distance, the symmetry of two points is considered, the similarity between every two points is measured by using a Gaussian kernel function in the original high-dimensional space, the similarity between the two points is measured by using t distribution in the mapped low-dimensional space, and finally the average KL divergence is minimized by using a gradient descent method to obtain a gradient, so that the dimension reduction of data is realized.

Preferably, the classification operation process for the low-dimensional data includes: the classification operation is to divide the boundary of data and divide the data with different characteristics, the specific operation process is to give all the low-dimensional data and the corresponding classification marks, if the data is linearly separable, the hyperplane of the data is directly found out, and if the data is not linearly separable, the hyperplane is mapped to the n + 1-dimensional space to find out the hyperplane.

Preferably, the process of predicting data using regression includes: firstly, building a tree on classified data, finding out the optimal feature to be segmented of the data, judging whether the data can be segmented or not, if the data can not be segmented, setting the data as leaf nodes, if the data can be segmented, segmenting a data set into left and right subtrees according to the optimal feature to be segmented, and then building the tree on the left and right subtrees;

the process of finding out the optimal segmentation characteristic comprises the steps of calculating the error of data segmentation each time, and if the current error is smaller than the current minimum error, setting the current segmentation as the optimal segmentation and updating the minimum value;

the process of predicting based on the regression tree is that whether the current regression tree is a leaf node is judged, if yes, the prediction is carried out, if not, the characteristic value of the corresponding characteristic of the test data is compared with the current regression tree, if the characteristic value of the test data is large, whether the left sub-tree and the right sub-tree of the current regression tree are leaf nodes is judged, if yes, the prediction is carried out, and if not, the regression prediction is started from the left sub-tree and the right sub-tree.

According to the method for predicting the DNS data, the dimension of the processed data can be reduced, the low-dimensional data is classified, regression is constructed for prediction, and the activity and the safety of the internet access can be analyzed.

Drawings

FIG. 1 shows a flow chart of a method for DNS data prediction according to an embodiment of the invention

FIG. 2 is a block diagram illustrating a method for DNS data prediction according to an embodiment of the present invention

FIG. 3 is a flow chart illustrating a network organization for a method of DNS data prediction according to an embodiment of the present invention

Detailed Description

The following is a detailed description of embodiments of the invention, illustrated in the accompanying drawings in which like or similar reference numerals refer to the same or similar components or components having the same or similar functions throughout the several views. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of illustrating the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or "coupled". As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As shown in fig. 1, an embodiment of the present invention is a method for DNS data prediction:

101: collecting log data of the DNS by using tools such as a collector and the like;

102: preprocessing the acquired data according to the characteristic value of the acquired data;

103: performing dimensionality reduction on the acquired data according to the similarity;

104: classifying the data of the low-dimensional space;

105: predicting the classified low-dimensional data by using regression;

106: collecting the prediction results and analyzing trends.

In step 101, collecting log data comprises:

the collected information is the log of the DNS server, which comprises information such as starting, restarting, closing, log outputting, message outputting and the like.

In step 102, the preprocessing operation procedure for the data includes:

attributes of DNS data include IP unit time, DNS request query time from DNS, unusual packet sharing, DNS request failures, source port entropy, domain name, information entropy, domain name peak, illegal domain name percentage peak, proportion of server denial of service.

The DNS raw data has several problems: inconsistency of data, data repetition, noise and high data dimensionality. The preprocessing of the data comprises several methods of data cleaning, data integration, data transformation and data reduction.

In step 103, the dimensionality reduction operation process on the data comprises the following steps:

the similarity of mapping a high-dimensional space to a low-dimensional space is expressed by adopting conditional probability instead of Euclidean distance, the symmetry of two points is considered, the similarity between every two points is measured by using a Gaussian kernel function in the original high-dimensional space, the similarity between the two points is measured by using t distribution in the mapped low-dimensional space, and finally the average KL divergence is minimized by using a gradient descent method to obtain a gradient, so that the dimension reduction of data is realized.

In step 104, the process of classifying the low dimensional data includes:

the classification operation aims at dividing the boundary of data and dividing the data with different characteristics, the specific operation process is to give all low-dimensional data and corresponding classification marks, if the data is linearly separable, the hyperplane of the data is directly found out, if the data is not linearly separable, the hyperplane is mapped to n + 1-dimensional space, and the hyperplane is found out, so that the expression of the hyperplane, namely a classification function, can be obtained. The low-dimensional data is classified.

In step 105, the process of predicting data using regression includes:

firstly, building a tree on classified data, finding out the optimal feature to be segmented of the data, judging whether the data can be segmented or not, if the data can not be segmented, setting the data as leaf nodes, if the data can be segmented, segmenting the data set into left and right subtrees according to the optimal feature to be segmented, and then building the tree on the left and right subtrees.

The process of finding out the optimal segmentation features comprises the steps of calculating the error of data segmentation each time, and if the current error is smaller than the current minimum error, setting the current segmentation as the optimal segmentation and updating the minimum value.

Secondly, the process of predicting based on the regression tree is that whether the current regression tree is a leaf node is judged, if yes, prediction is carried out, if not, the characteristic value of the corresponding characteristic of the test data is compared with the current regression tree, if the characteristic value of the test data is large, whether left and right subtrees of the current regression tree are leaf nodes is judged, if yes, prediction is carried out, and if not, regression prediction is carried out from the left and right subtrees.

In step 106, the prediction result analysis process includes:

the prediction comprises the steps of predicting the activity of website users and the safety of websites, predicting low-dimensional data subjected to classification processing through regression, and analyzing the low-dimensional data to play an important role in the aspects of advertisement use and safety.

Fig. 2 is a block diagram illustrating a method for predicting DNS data according to an embodiment of the present invention, wherein collecting DNS server data is a log of a DNS server, which includes taking information such as start-up, restart, shutdown, log output, and message. The preprocessing operation of the data comprises several methods of data cleaning, data integration, data transformation and data reduction. The dimensionality reduction operation process of the data mainly uses the conditional probability to replace the Euclidean distance to represent the similarity of the high-dimensional space mapped to the low-dimensional space so as to realize dimensionality reduction, and the main purpose is to eliminate redundancy and reduce the quantity of the processed data. The main purpose of the classification operation is to optimize the data regression effect. The regression tree is mainly used for predicting data, and then the visit volume of the website is analyzed according to the prediction result, so that the problems of advertisement delivery, website safety and the like of the website can be judged.

Fig. 3 shows a network organization flowchart of a method for predicting DNS data according to an embodiment of the present invention, where the method first uses a data collector and other tools to collect data such as log information of a DNS server, then preprocesses the data by data cleaning, integration, transformation, reduction, and other methods, then performs dimensionality reduction on the data to extract effective information and discard useless information, then performs classification on low-dimensional data, and can better construct a tree by classification, so that accuracy of data regression prediction is greatly improved, and then realizes prediction on the data by constructing a regression tree, and the predicted data has great effects on traffic analysis of a website and security of the website.

Claims

1. A method for DNS data prediction, comprising the steps of:

collecting log data of the DNS by using tools such as a collector and the like;

classifying the data of the low-dimensional space;

predicting the classified low-dimensional data by using regression;

collecting the prediction results and analyzing trends.

2. The method of claim 1, wherein the collected information is a log of the DNS server, wherein the log includes start-up, restart, shutdown, log output, and message information.

3. The method of claim 1, wherein the data preprocessing operation comprises:

4. The method of claim 1, wherein the dimensionality reduction operation on the data comprises: the similarity of mapping a high-dimensional space to a low-dimensional space is expressed by adopting conditional probability instead of Euclidean distance, the symmetry of two points is considered, the similarity between every two points is measured by using a Gaussian kernel function in the original high-dimensional space, the similarity between the two points is measured by using t distribution in the mapped low-dimensional space, and finally the average KL divergence is minimized by using a gradient descent method to obtain a gradient, so that the dimension reduction of data is realized.

5. The method of claim 1, wherein the classifying operation on the low-dimensional data comprises: the classification operation is to divide the boundary of data and divide the data with different characteristics, the specific operation process is to give all the low-dimensional data and the corresponding classification marks, if the data is linearly separable, the hyperplane of the data is directly found out, and if the data is not linearly separable, the hyperplane is mapped to the n + 1-dimensional space to find out the hyperplane.

6. The method of claim 1, wherein using regression on the data to perform the prediction process comprises: firstly, building a tree on classified data, finding out the optimal feature to be segmented of the data, judging whether the data can be segmented or not, if the data can not be segmented, setting the data as leaf nodes, if the data can be segmented, segmenting a data set into left and right subtrees according to the optimal feature to be segmented, and then building the tree on the left and right subtrees;