CN114826712B - Malicious domain name detection method and device and electronic equipment - Google Patents

Malicious domain name detection method and device and electronic equipment Download PDF

Info

Publication number
CN114826712B
CN114826712B CN202210396045.XA CN202210396045A CN114826712B CN 114826712 B CN114826712 B CN 114826712B CN 202210396045 A CN202210396045 A CN 202210396045A CN 114826712 B CN114826712 B CN 114826712B
Authority
CN
China
Prior art keywords
domain name
data
target
features
user interest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210396045.XA
Other languages
Chinese (zh)
Other versions
CN114826712A (en
Inventor
迟菁华
王文旭
刘嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN202210396045.XA priority Critical patent/CN114826712B/en
Publication of CN114826712A publication Critical patent/CN114826712A/en
Application granted granted Critical
Publication of CN114826712B publication Critical patent/CN114826712B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a malicious domain name detection method, a malicious domain name detection device and electronic equipment, wherein the malicious domain name detection method comprises the following steps: acquiring domain name system data to be detected; performing data cleaning on domain name system data to obtain target data; extracting features of the target data and generating feature vectors, wherein the features based on the feature extraction at least comprise domain name statistical features and user interest access features; the feature vector is input into a target detection model to obtain a detection result corresponding to the domain name system data, the target detection model is a neural network model obtained through training of domain name statistical features and user interest access features, and the target detection model is used for detecting whether the domain name system data is a malicious domain name or not. The method and the device can combine the domain name statistics feature with the user interest access feature, can effectively resist the disguising of an attacker on the domain name, and improve the accuracy of malicious domain name detection.

Description

Malicious domain name detection method and device and electronic equipment
Technical Field
The present invention relates to the field of information processing technologies, and in particular, to a method and an apparatus for detecting a malicious domain name, and an electronic device.
Background
With the development of internet technology, the network security threat is called as a problem to be solved in the research and development process of the internet technology. Most of network security threats are unrecorded and unknown threats, and the detection and identification of the unknown threats are significant in guaranteeing network space security.
In the common malicious domain name detection method, domain name characteristics are required to be extracted to judge the malicious domain name, but in order to avoid a detection mode based on domain name statistical characteristics, an attacker usually disguises the malicious domain name as data similar to benign domain names, so that the accuracy of detecting the malicious domain name is reduced.
Disclosure of Invention
Aiming at the problems, the invention provides a malicious domain name detection method, a malicious domain name detection device and electronic equipment, which improve the accuracy of malicious domain name detection.
In order to achieve the above object, the present invention provides the following technical solutions:
a method of malicious domain name detection, the method comprising:
Acquiring domain name system data to be detected;
Performing data cleaning on the domain name system data to obtain target data;
extracting features of the target data and generating feature vectors, wherein the features based on the feature extraction at least comprise domain name statistical features and user interest access features;
And inputting the feature vector into a target detection model to obtain a detection result corresponding to the domain name system data, wherein the target detection model is a neural network model obtained by training domain name statistical features and user interest access features, and is used for detecting whether the domain name system data is a malicious domain name.
Optionally, the performing data cleaning on the domain name system data to obtain target data includes:
based on the data source characteristics, carrying out data source cleaning on the domain name system data to obtain a first data set;
And carrying out data extraction on the first data set based on user access characteristics to obtain target data, wherein the user access characteristics comprise a user IP address, a user access domain name and a time stamp of each domain name accessed by the user.
Optionally, the feature extraction of the target data generates a feature vector, including:
extracting domain name statistical characteristics of the target data, wherein the domain name statistical characteristics comprise domain name character characteristics, information entropy and mean value information of a language model mean value sequence;
Classifying unknown domain names in the target data to obtain domain name classification results;
Determining user interest labels according to domain name classification results;
Determining whether each domain name in the target data is accessed by users with different user interest labels in the same time period to obtain user interest access characteristics;
and generating a feature vector of each domain name corresponding to the target data based on the user interest access feature.
Optionally, the method further comprises:
Performing data cleaning on the obtained original domain name system data to obtain an effective data set;
performing feature analysis on the effective data set to obtain domain name statistical features and user interest access features;
generating a domain name feature vector according to the domain name statistical feature and the user interest access feature;
determining a domain name category label matched with each domain name feature vector, wherein the domain name category label comprises a benign domain name label and a malicious domain name label;
and training the neural network based on the effective data set marked with the domain name class labels to obtain a target detection model.
Optionally, the method further comprises:
acquiring original domain name system data, including:
acquiring data from a target domain name server to a local domain name server;
acquiring data from the local domain name server to a client;
acquiring data from the client to a local domain name server;
And acquiring data from the local domain name server to the target domain name server.
Optionally, the classifying the unknown domain name in the target data to obtain a domain name classification result includes:
Classifying unknown domain names in the target data based on a domain name classifier to obtain domain name classification results;
The domain name classifier is a neural network model obtained by performing natural language processing training on the obtained domain name data, and is used for classifying the webpage category corresponding to the domain name.
Optionally, the determining the user interest tag according to the domain name classification result includes:
determining the webpage category according to the domain name classification result;
acquiring access frequency data of each user for accessing a website of a corresponding webpage class;
and determining the user interest tag based on the access frequency data.
A malicious domain name detection apparatus, the apparatus comprising:
The data acquisition unit is used for acquiring domain name system data to be detected;
The data cleaning unit is used for cleaning the data of the domain name system to obtain target data;
The feature extraction unit is used for extracting features of the target data and generating feature vectors, wherein the features based on the feature extraction at least comprise domain name statistical features and user interest access features;
The detection unit is used for inputting the feature vector into a target detection model to detect and obtain a detection result corresponding to the domain name system data, the target detection model is a neural network model obtained by training the domain name statistical feature and the user interest access feature, and the target detection model is used for detecting whether the domain name system data is a malicious domain name or not.
A storage medium storing executable instructions which when executed by a processor implement a malicious domain name detection method as claimed in any one of the preceding claims.
An electronic device, comprising:
A memory for storing a program;
the processor is configured to execute the program, where the program is specifically configured to implement the malicious domain name detection method according to any one of the foregoing.
Compared with the prior art, the invention provides a malicious domain name detection method, a malicious domain name detection device and electronic equipment, wherein the malicious domain name detection method comprises the following steps: acquiring domain name system data to be detected; performing data cleaning on domain name system data to obtain target data; extracting features of the target data and generating feature vectors, wherein the features based on the feature extraction at least comprise domain name statistical features and user interest access features; the feature vector is input into a target detection model to obtain a detection result corresponding to the domain name system data, the target detection model is a neural network model obtained through training of domain name statistical features and user interest access features, and the target detection model is used for detecting whether the domain name system data is a malicious domain name or not. The method and the device can combine the domain name statistics feature with the user interest access feature, can effectively resist the disguising of an attacker on the domain name, and improve the accuracy of malicious domain name detection.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic flow chart of a malicious domain name detection method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a domain name system data source according to an embodiment of the present invention;
Fig. 3 is a schematic structural diagram of a domain name classifier according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a feature analyzer according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of a machine learning trainer according to an embodiment of the present invention;
Fig. 6 is a schematic structural diagram of a malicious domain name detection device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first and second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to the listed steps or elements but may include steps or elements not expressly listed.
The embodiment of the invention provides a malicious domain name detection method, which is based on domain name statistics characteristics and user interest access characteristics, so that the problems of single detection path, easy avoidance of an attacker and the like of the traditional method are effectively avoided, and the detection effect of the malicious domain name is improved.
Referring to fig. 1, a flow chart of a malicious domain name detection method provided by an embodiment of the present invention may include the following steps:
s101, acquiring domain name system data to be detected.
The domain name system data to be detected generally includes a plurality of pieces of domain name data, which are data for which whether a malicious domain name is not yet determined, wherein the domain name system is a distributed database in which a domain name and an IP address are mapped to each other, which enables a user to access the internet more conveniently,
S102, data cleaning is carried out on the domain name system data to obtain target data.
Wherein the target data is data that facilitates analysis of Domain name statistics and user interest access features (domains NAME SYSTE, DNS). In one embodiment, the performing data cleaning on the domain name system data to obtain target data includes: based on the data source characteristics, carrying out data source cleaning on the domain name system data to obtain a first data set; and carrying out data extraction on the first data set based on user access characteristics to obtain target data, wherein the user access characteristics comprise a user IP address, a user access domain name and a time stamp of each domain name accessed by the user. The data from different data sources have different formats or record forms, so that unified processing is required to be performed on domain name system data according to the characteristics of the data sources to obtain a first data set, namely, the first data set is data obtained by cleaning whether the data formats are unified, whether the data are missing or not and other dimensions. And then extracting relevant data according to the access characteristics of the user.
Specifically, there are a large number of domain names in xx.in-addr.arpa format in DNS data, which belongs to a reverse DNS lookup for converting a 32-bit digital IP address back to a domain name. Such as IP address 218.30.103.170, with a reverse domain name expression of 170.103.30.218.In-addr. These data do not help in malicious domain name detection, and need to be removed by a data cleaning work, so that unnecessary data volume is reduced, and detection time is shortened.
S103, extracting features of the target data and generating feature vectors.
The features based on feature extraction include at least domain name statistics features and user interest access features. Specifically, the feature extraction of the target data to generate a feature vector includes: extracting domain name statistical characteristics of the target data, wherein the domain name statistical characteristics comprise domain name character characteristics, information entropy and mean value information of a language model mean value sequence; classifying unknown domain names in the target data to obtain domain name classification results; determining user interest labels according to domain name classification results; determining whether each domain name in the target data is accessed by users with different user interest labels in the same time period to obtain user interest access characteristics; and generating a feature vector of each domain name corresponding to the target data based on the user interest access feature.
The extraction of the domain name statistical features mainly focuses on the information such as domain name character features, information entropy, N-gram mean value sequence mean value and the like. The user IP address, user access domain name, and access time data may be used to analyze the user interest access characteristics.
S104, inputting the feature vector into a target detection model to obtain a detection result corresponding to the domain name system data.
The target detection model is a neural network model obtained by training domain name statistical characteristics and user interest access characteristics, and is used for detecting whether domain name system data is a malicious domain name or not.
It should be noted that in the embodiment of the present invention, the malicious domain name classification detection is implemented through the target detection model. The generation process of the target detection model is matched with the execution flow of the malicious domain name detection method. That is, the process of generating the object detection model includes: performing data cleaning on the obtained original domain name system data to obtain an effective data set; performing feature analysis on the effective data set to obtain domain name statistical features and user interest access features; generating a domain name feature vector according to the domain name statistical feature and the user interest access feature; determining a domain name category label matched with each domain name feature vector, wherein the domain name category label comprises a benign domain name label and a malicious domain name label; and training the neural network based on the effective data set marked with the domain name class labels to obtain a target detection model.
In the process of generating the target detection model in the embodiment of the invention, the data washer can be utilized to finish data washing of the obtained original domain name system data to obtain an effective data set; the domain name classifier is utilized to classify the domain name; performing feature analysis on the effective data set by using a feature analyzer to obtain domain name statistical features and user interest access features; and training the effective data set by using a machine learning trainer through a neural network to obtain a target detection model.
It should be noted that, the processes of data cleaning, feature extraction, domain name classification and the like based on the malicious domain name detection process may also refer to the following detailed description of the process of generating the target detection model.
First, referring to fig. 2, a schematic diagram of a domain name system data source provided in an embodiment of the present invention, correspondingly, acquiring original domain name system data includes: acquiring data from a target domain name server to a local domain name server; acquiring data from the local domain name server to a client; acquiring data from the client to a local domain name server; and acquiring data from the local domain name server to the target domain name server. The target domain name server refers to an authoritative domain name server, and can be determined based on a specific application scene. It should be noted that, in the embodiment of the present invention, the original domain name system data is DNS data generated in real time from a certain provider backbone network node, different granularities can be obtained by collecting data at different locations of the DNS architecture, and an Internet Service Provider (ISP) or a higher authority server can provide detailed log data, or can provide data with statistical significance, which is beneficial to detecting malicious behaviors. Wherein, the requirement of analyzing the interest of the user is detected aiming at the malicious domain name, only the DNS data from the local server to the client can be collected. According to the detection target, besides collecting the domain name, the domain name frequently visited by each user needs to be known through DNS data, and the user interest, such as a favorite video website and a shopping website, needs to be further known through the webpage type to which the domain name belongs, so that the user IP and the user visited domain name in the DNS data returned to the client by the resolver need to be extracted. When users of different interests access the same domain name in the same time period, the domain name has a high probability of being a malicious domain name which is accessed passively, so that the time of the user accessing the domain name also needs to be known.
In summary, the DNS data cleaning section mainly includes:
Acquiring a user IP address;
acquiring a user access domain name;
a timestamp of the user's access to each domain name is obtained.
There are a large number of domain names in xx.in-addr.arpa format in DNS data, which belongs to a reverse DNS lookup for converting a 32-bit digital IP address back to a domain name. Such as IP address 218.30.103.170, with a reverse domain name expression of 170.103.30.218.In-addr. These data are not helpful for malicious domain name detection, and need to be removed by a data cleaning work, so as to reduce unnecessary data volume and shorten detection time.
In the embodiment of the invention, the position domain name in the target data can be classified by a domain name classifier to obtain a domain name classification result, wherein the domain name classifier is a neural network model obtained by performing natural language processing training on the obtained domain name data and is used for classifying the webpage category corresponding to the domain name.
Specifically, in order to perform interest analysis for each user IP, it is necessary to know the types of web sites frequently searched by the user IP, so that it is necessary to complete the classification of web sites and to count the web sites under each type as comprehensively as possible. When a user accesses a domain name, it can determine the website category (such as shopping, sports, medical, life service, etc.) to which the website accessed by the user belongs. The website data under various website types can be obtained from the domain name classification website, but the amount of domain names related to the DNS request log is very large, and the existing data cannot be completely covered. Therefore, crawlers, natural Language Processing (NLP) and some classification algorithms can be used for processing domain names which cannot be classified to generate an optimal domain name classification model.
Referring to fig. 3, a schematic structural diagram of a domain name classifier according to an embodiment of the present invention is provided. The main purpose of the domain name classifier is to divide the category of the web page, extract the sentences beneficial to the division category in the web page through the HTML document processing element, divide each long sentence into words with practical meaning through the Jieba word dividing element, then generate a vector space through the word vectorization element, the space usually has tens to hundreds dimensions, parameters can be specifically set, and each word is allocated to a unique vector in the space. And finally, generating an optimal domain name classification model through a model trainer.
In one embodiment, referring to fig. 4, the domain name statistics feature and the domain name feature analysis based on user interest modeling may be obtained through a domain name analyzer, so as to obtain the user interest access feature. The main processing flow of the domain name analyzer comprises: extracting information such as domain name character characteristics, information entropy, N-gram mean value sequence mean value and the like; classifying the unknown domain name through a domain name classification model; defining a user interest label according to the domain name classification result; analyzing whether each domain name is accessed by users with different interests in the same time period; and generating a feature vector corresponding to each domain name.
Specifically, the extraction of the domain name statistical features mainly focuses on the information such as domain name character features, information entropy, N-gram mean value sequence mean value and the like. The length of the domain name is a primary feature that distinguishes between benign and malignant domain names. The domain name is presented for easy memorization and convenient access by the user, so the length of the normal domain name is not too long. However, the malicious domain name generated by DGA is generally not manually connected by the user, and the one-time generation amount is large, so that the domain name length is often longer in order to prevent the conflict with the normal domain name. Benign and malicious domain names also differ in number, succession and dispersion of characters. Since DGA domain names are mostly randomly generated according to a certain seed, the generated character strings rarely appear a continuous character or a continuous number. However, the normal domain name is not, and numbers or characters are generally concentrated in order to represent some meanings, and there are few cases where letters and numbers cross. There are 26 english alphabets in total, but there are only 5 vowels. Normal domain names are typically very readable, so each normal domain name should be one or more vowels, but the random generatability of DGA domain names results in a relatively low vowel ratio for malicious domain names. Information entropy refers to the average amount of information or expectations of each event, and when an event is a regular event, its entropy value tends to be small, and when it is an irregular event, the entropy value tends to be large. The larger the entropy value, the more random it is, and vice versa. In order to alleviate the difficulty of memorizing the IP, most benign domain names are spliced by English words, and accord with the characteristics of the English words. N-gram is a concept in Natural Language Processing (NLP). This concept is introduced here to count the 1-gram, 2-gram, 3-gram sequences for each domain name, i.e., the sequences of consecutive 1 contiguous character, consecutive 2 contiguous characters, consecutive 3 contiguous characters in each domain name.
The embodiment of the invention also provides a method for determining the user interest tag according to the domain name classification result, which comprises the following steps: determining the webpage category according to the domain name classification result; acquiring access frequency data of each user for accessing a website of a corresponding webpage class; and determining the user interest tag based on the access frequency data.
The user IP address, the user access domain name and the access time data obtained after the DNS log data are cleaned can be used for analyzing the user interests. Domain name class identification may be accomplished using an optimal domain name classification model generated by a domain name classifier. Modeling of user interests is performed when determining user interest tags, using access frequency data, i.e., which class of domain name is frequently accessed by a user defines that class as the user's interests. For example, each IP address represents a user, and the cleaned DNS request log data is first classified according to the IP address, and all domain names visited by each IP address are counted respectively. And then identifying the category of all the access domain names of each user through the constructed domain name classifier. And finally counting all website categories such as shopping categories, sports fitness categories, network science categories and the like which are visited by each user, and the times of the user visiting the categories. A unified frequency threshold is specified, and when a user accesses a category more than the threshold, the user is considered to be interested in the category, and the category can be defined as the user's interest, and the user interest tag is determined.
Further, after the statistics work is completed, the similarity between each user needs to be calculated. The interests of each user are represented by vectors. Since the similarity calculation method is multiple, only focusing on how much the interest vectors corresponding to each user are identical, a Jacquard similarity algorithm is selected here, the calculation formula of which is shown below, and X and Y respectively represent the interest vectors of different users
When the similarity of two users is 0, they are considered to be users with different interests. When two users are judged to be interested in different users, whether the two users access the same domain name in a certain time period is checked. If the domain name is accessed by users of different interests for the same period of time, the domain name is labeled 1. When all user IP with different interests are queried, the domain name which is not marked as 1 is marked as 0.
Therefore, in the embodiment of the invention, the multidimensional characteristics are obtained by analysis, the interests of the users are modeled for resisting the flexible and changeable attack modes of the attackers, and the characteristic of whether the domain names are accessed by the users with different interests in the same time period is added into the classification algorithm. If the domain name is accessed by users with different interests in the same time period, the value of the feature is 1, otherwise, the value of the feature is 0. Table 1 is the overall domain name features used to detect malicious domain names based on user interest modeling. Referring to table 1, table 1 is a common domain name feature exemplified by the present invention.
TABLE 1
Sequence number Domain name feature
1 Whether or not it is a common top-level domain name
2 Domain name Length
3 Digital duty cycle
4 Continuous digital duty cycle
5 Continuous character duty cycle
6 Continuous identical character duty cycle
7 Vowel duty ratio
8 Information entropy
9 1-Gram sequence average ranking
10 2-Gram sequence average ranking
11 3-Gram sequence average ranking
12 Whether or not it has been accessed by users of different interests in the same period of time
Based on the above processing, each domain name feature vector can be obtained, which is used as an input of a machine learning trainer, see fig. 5, which is a schematic structural diagram of the machine learning trainer provided by the embodiment of the invention. The machine learning trainer takes each domain name feature vector as input, firstly forms a training set, inputs neural network architectures of different architectures and parameters for training and result analysis, circularly executes network architecture adjustment and parameter tuning, and finally outputs an optimal training model. According to a certain proportion, dividing data into a training set and a testing set, respectively designing network models with different architectures, taking a convolutional neural network as an example, and selecting network models with different convolutional layer numbers and convolutional kernel numbers as training models. For different network architectures, the main parameters to be adjusted are the batch training size and the learning rate, which are the most important parameters for deep learning, and the optimal batch training size and the learning rate are selected as key elements of a good model at the training place. And respectively selecting the optimal network architecture and training parameters according to different expression modes to obtain the optimal training model under different conditions. The advantage of selecting different representations is that, for different malicious domain names, the situation that an optimal result can be achieved may be detected by using different representations. Therefore, according to different situations in the process, an optimal training model corresponding to the specific malicious domain name type aiming at the specific identification type is generated. And finally, testing and verifying the result of the optimal training model by using the test set data.
After being trained by a machine learning trainer, an optimal training model aiming at different conditions is generated, and a classification detector is formed on the basis of the optimal training model to obtain a target detection model, wherein the target detection model can detect the DNS data to be detected truly. The classification detector takes the feature vector of the real domain name after data cleaning and feature analysis and the scene demand parameter as input. The classification selector inputs parameters according to scene requirements, performs optimal model selection through the model selection element, detects and identifies the data in the data detection identifier after selection, and finally outputs malicious type detection results.
The invention adds the feature of user interest based on the prior detection of malicious domain names based on domain name features and machine learning modes, combines the statistical features of traditional malicious domain name detection with the user interest, analyzes DNS data generated in real time in a network, effectively resists the situation that an attacker avoids the detection mode based on the statistical features of domain names, and makes the composition of the malicious domain names and benign domain names have little difference as far as possible, thereby making up the defect of the traditional statistical features in the aspect.
Referring to fig. 6, in another embodiment of the present invention, there is further provided a malicious domain name detection apparatus, including:
A data acquisition unit 601, configured to acquire domain name system data to be detected;
A data cleaning unit 602, configured to perform data cleaning on the domain name system data to obtain target data;
A feature extraction unit 603, configured to perform feature extraction on the target data, and generate a feature vector, where a feature based on feature extraction includes at least a domain name statistics feature and a user interest access feature;
The detection unit 604 is configured to input the feature vector to a target detection model, and detect to obtain a detection result corresponding to the domain name system data, where the target detection model is a neural network model obtained by training a domain name statistical feature and a user interest access feature, and the target detection model is configured to detect whether the domain name system data is a malicious domain name.
Further, the data cleaning unit includes:
The cleaning subunit is used for cleaning the data sources of the domain name system data based on the data source characteristics to obtain a first data set;
And the first extraction subunit is used for carrying out data extraction on the first data set based on user access characteristics to obtain target data, wherein the user access characteristics comprise a user IP address, a user access domain name and a time stamp of each domain name accessed by the user.
Optionally, the feature extraction unit includes:
The second extraction subunit is used for extracting domain name statistical characteristics of the target data, wherein the domain name statistical characteristics comprise domain name character characteristics, information entropy and mean value information of a language model mean value sequence;
The classifying subunit is used for classifying the unknown domain name in the target data to obtain a domain name classifying result;
the first determining subunit is used for determining the user interest tag according to the domain name classification result;
The second determining subunit is used for determining whether each domain name in the target data is accessed by users with different user interest labels in the same time period to obtain user interest access characteristics;
And the generation subunit is used for generating a feature vector of each domain name corresponding to the target data based on the user interest access feature.
Optionally, the apparatus further comprises:
the data cleaning unit is also used for cleaning the data of the obtained original domain name system data to obtain an effective data set;
The feature analysis unit is used for carrying out feature analysis on the effective data set to obtain domain name statistical features and user interest access features;
The vector generation unit is used for generating a domain name feature vector according to the domain name statistical feature and the user interest access feature;
a label determining unit, configured to determine a domain name category label matched with each domain name feature vector, where the domain name category label includes a benign domain name label and a malicious domain name label;
And the training unit is used for training the neural network based on the effective data set marked with the domain name class labels to obtain a target detection model.
Optionally, the apparatus further comprises:
The data acquisition unit is used for acquiring original domain name system data, and is specifically used for:
acquiring data from a target domain name server to a local domain name server;
acquiring data from the local domain name server to a client;
acquiring data from the client to a local domain name server;
And acquiring data from the local domain name server to the target domain name server.
Further, the classifying the unknown domain name in the target data to obtain a domain name classification result includes:
Classifying unknown domain names in the target data based on a domain name classifier to obtain domain name classification results;
The domain name classifier is a neural network model obtained by performing natural language processing training on the obtained domain name data, and is used for classifying the webpage category corresponding to the domain name.
Optionally, the classifying subunit is specifically configured to:
determining the webpage category according to the domain name classification result;
acquiring access frequency data of each user for accessing a website of a corresponding webpage class;
and determining the user interest tag based on the access frequency data.
The invention provides a malicious domain name detection device, which comprises: the method comprises the steps that a data acquisition unit acquires domain name system data to be detected; the data cleaning unit performs data cleaning on domain name system data to obtain target data; the feature extraction unit performs feature extraction on the target data and generates feature vectors, wherein the features based on the feature extraction at least comprise domain name statistical features and user interest access features; the detection unit inputs the feature vector to a target detection model to obtain a detection result corresponding to the domain name system data, the target detection model is a neural network model obtained by training the domain name statistical feature and the user interest access feature, and the target detection model is used for detecting whether the domain name system data is a malicious domain name. The method and the device can combine the domain name statistics feature with the user interest access feature, can effectively resist the disguising of an attacker on the domain name, and improve the accuracy of malicious domain name detection.
Based on the foregoing embodiments, a storage medium stores executable instructions that when executed by a processor implement a malicious domain name detection method as described in any one of the above.
The embodiment of the invention also provides electronic equipment, which comprises: a memory for storing a program; the processor is configured to execute the program, where the program is specifically configured to implement the malicious domain name detection method according to any one of the foregoing.
It should be noted that, the specific implementation of the processor may be referred to the description of the foregoing embodiments, which is not described in detail herein.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A method for detecting a malicious domain name, the method comprising:
Acquiring domain name system data to be detected;
Performing data cleaning on the domain name system data to obtain target data;
extracting features of the target data and generating feature vectors, wherein the features based on the feature extraction at least comprise domain name statistical features and user interest access features;
The method for extracting the characteristics of the target data to generate the characteristic vector comprises the following steps:
extracting domain name statistical characteristics of the target data, wherein the domain name statistical characteristics comprise domain name character characteristics, information entropy and mean value information of a language model mean value sequence;
classifying unknown domain names in the target data to obtain domain name classification results; the classifying the unknown domain name in the target data to obtain a domain name classification result comprises the following steps: classifying unknown domain names in the target data based on a domain name classifier to obtain domain name classification results; the domain name classifier is a neural network model obtained by performing natural language processing training on the obtained domain name data and is used for classifying the webpage category corresponding to the domain name;
Determining user interest labels according to domain name classification results;
Determining whether each domain name in the target data is accessed by users with different user interest labels in the same time period to obtain user interest access characteristics;
Generating a feature vector of each domain name corresponding to the target data based on the domain name statistical features and the user interest access features;
And inputting the feature vector into a target detection model to obtain a detection result corresponding to the domain name system data, wherein the target detection model is a neural network model obtained by training domain name statistical features and user interest access features, and is used for detecting whether the domain name system data is a malicious domain name.
2. The method according to claim 1, wherein the performing data cleansing on the domain name system data to obtain target data includes:
based on the data source characteristics, carrying out data source cleaning on the domain name system data to obtain a first data set;
And carrying out data extraction on the first data set based on user access characteristics to obtain target data, wherein the user access characteristics comprise a user IP address, a user access domain name and a time stamp of each domain name accessed by the user.
3. The method according to claim 1, wherein the method further comprises:
Performing data cleaning on the obtained original domain name system data to obtain an effective data set;
performing feature analysis on the effective data set to obtain domain name statistical features used in the training process and user interest access features used in the training process;
Generating domain name feature vectors used in the training process according to the domain name statistical features used in the training process and the user interest access features used in the training process;
Determining a domain name category label matched with a domain name feature vector used in each training process, wherein the domain name category label comprises a benign domain name label and a malicious domain name label;
and training the neural network based on the effective data set marked with the domain name class labels to obtain a target detection model.
4. A method according to claim 3, characterized in that the method further comprises:
acquiring original domain name system data, including:
acquiring data from a target domain name server to a local domain name server;
acquiring data from the local domain name server to a client;
acquiring data from the client to a local domain name server;
And acquiring data from the local domain name server to the target domain name server.
5. The method of claim 1, wherein determining the user interest tag based on the domain name classification result comprises:
determining the webpage category according to the domain name classification result;
acquiring access frequency data of each user for accessing a website of a corresponding webpage class;
and determining the user interest tag based on the access frequency data.
6. A malicious domain name detection apparatus, the apparatus comprising:
The data acquisition unit is used for acquiring domain name system data to be detected;
The data cleaning unit is used for cleaning the data of the domain name system to obtain target data;
The feature extraction unit is used for extracting features of the target data and generating feature vectors, wherein the features based on the feature extraction at least comprise domain name statistical features and user interest access features; the method for extracting the characteristics of the target data to generate the characteristic vector comprises the following steps: extracting domain name statistical characteristics of the target data, wherein the domain name statistical characteristics comprise domain name character characteristics, information entropy and mean value information of a language model mean value sequence; classifying unknown domain names in the target data to obtain domain name classification results; the classifying the unknown domain name in the target data to obtain a domain name classification result comprises the following steps: classifying unknown domain names in the target data based on a domain name classifier to obtain domain name classification results; the domain name classifier is a neural network model obtained by performing natural language processing training on the obtained domain name data and is used for classifying the webpage category corresponding to the domain name; determining user interest labels according to domain name classification results; determining whether each domain name in the target data is accessed by users with different user interest labels in the same time period to obtain user interest access characteristics; generating a feature vector of each domain name corresponding to the target data based on the domain name statistical features and the user interest access features;
The detection unit is used for inputting the feature vector into a target detection model to detect and obtain a detection result corresponding to the domain name system data, the target detection model is a neural network model obtained by training the domain name statistical feature and the user interest access feature, and the target detection model is used for detecting whether the domain name system data is a malicious domain name or not.
7. A storage medium storing executable instructions which when executed by a processor implement a malicious domain name detection method according to any one of claims 1 to 5.
8. An electronic device, comprising:
A memory for storing a program;
a processor for executing the program, in particular for implementing the malicious domain name detection method according to any one of claims 1-5.
CN202210396045.XA 2022-04-15 2022-04-15 Malicious domain name detection method and device and electronic equipment Active CN114826712B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210396045.XA CN114826712B (en) 2022-04-15 2022-04-15 Malicious domain name detection method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210396045.XA CN114826712B (en) 2022-04-15 2022-04-15 Malicious domain name detection method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN114826712A CN114826712A (en) 2022-07-29
CN114826712B true CN114826712B (en) 2024-06-14

Family

ID=82537327

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210396045.XA Active CN114826712B (en) 2022-04-15 2022-04-15 Malicious domain name detection method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114826712B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818334A (en) * 2017-09-29 2018-03-20 北京邮电大学 A kind of mobile Internet user access pattern characterizes and clustering method
CN110138763A (en) * 2019-05-09 2019-08-16 中国科学院信息工程研究所 A kind of inside threat detection system and method based on dynamic web browsing behavior

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2988455A1 (en) * 2014-08-22 2016-02-24 Verisign, Inc. Domain name system traffic analysis
CN105897752B (en) * 2016-06-03 2019-08-02 北京奇虎科技有限公司 The safety detection method and device of unknown domain name
US11388142B2 (en) * 2019-01-15 2022-07-12 Infoblox Inc. Detecting homographs of domain names

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107818334A (en) * 2017-09-29 2018-03-20 北京邮电大学 A kind of mobile Internet user access pattern characterizes and clustering method
CN110138763A (en) * 2019-05-09 2019-08-16 中国科学院信息工程研究所 A kind of inside threat detection system and method based on dynamic web browsing behavior

Also Published As

Publication number Publication date
CN114826712A (en) 2022-07-29

Similar Documents

Publication Publication Date Title
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
CN107566376B (en) Threat information generation method, device and system
CN106778241B (en) Malicious file identification method and device
CN111818198B (en) Domain name detection method, domain name detection device, equipment and medium
WO2008022581A1 (en) Method and device for obtaining the new words and input method system
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN108269122B (en) Advertisement similarity processing method and device
CN112866023A (en) Network detection method, model training method, device, equipment and storage medium
CN113779481B (en) Method, device, equipment and storage medium for identifying fraud websites
CN116257406A (en) Gateway data management method and system for smart city
CN112528294A (en) Vulnerability matching method and device, computer equipment and readable storage medium
CN103530312A (en) User identification method and system using multifaceted footprints
CN111967503A (en) Method for constructing multi-type abnormal webpage classification model and abnormal webpage detection method
CN106446123A (en) Webpage verification code element identification method
CN112115326A (en) Multi-label classification and vulnerability detection method for Ether house intelligent contracts
CN117407288B (en) Test case recommendation method based on FPGA (field programmable Gate array) test platform
CN105095203B (en) Determination, searching method and the server of synonym
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
CN107291686B (en) Method and system for identifying emotion identification
CN114826712B (en) Malicious domain name detection method and device and electronic equipment
CN114884686B (en) PHP threat identification method and device
CN112348041A (en) Log classification and log classification training method and device, equipment and storage medium
CN111353300B (en) Data set construction and related information acquisition method and device
CN105677827B (en) A kind of acquisition methods and device of list
CN114492390A (en) Data expansion method, device, equipment and medium based on keyword recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant