CN114928472A

CN114928472A - Method for filtering bad site grey list based on full-volume circulation main domain name

Info

Publication number: CN114928472A
Application number: CN202210416876.9A
Authority: CN
Inventors: 张兆心; 孟月阳; 柴婷婷; 赵东; 陈俊仁
Original assignee: Harbin Institute of Technology Weihai
Current assignee: Harbin Institute of Technology Weihai
Priority date: 2022-04-20
Filing date: 2022-04-20
Publication date: 2022-08-19
Anticipated expiration: 2042-04-20
Also published as: CN114928472B

Abstract

The invention provides a method for filtering a grey list of bad sites based on a full-volume circulation main domain name, which comprises the following steps: step 1, constructing a name discrimination model of a bad site domain name based on character similarity, and realizing coarse filtering of suspected bad site domain names in a full domain name; step 2, identifying whether the domain name can be resolved and used for Web service; step 3, performing coarse filtration based on IP similarity; step 4, classifying the geographical regions of the domain names based on an IP positioning technology; step 5, analyzing the accuracy of the bad site domain name grey list obtained by coarse filtration; and 6, performing iterative optimization on the coarse filtering step 1 and the coarse filtering step 3. The method reduces the magnitude range of the existing domain name in a large range by filtering the similarity of the domain name characters and the service IP, greatly reduces the time consumption caused by acquiring and analyzing the web page text and the snapshot, and realizes the efficient and accurate filtering of the whole domain name.

Description

Method for filtering bad site grey list based on full-volume circulation main domain name

Technical Field

The invention relates to the technical field of construction of a domain name grey list of bad sites, in particular to a method for filtering the bad site grey list based on a full-flow main domain name.

Background

With the rapid development of computer networks, the internet has become an indispensable part of human life. The domain name system provides a mutual mapping function of an IP address and a domain name for applications and services in a network. Through the domain name, people can more conveniently access the internet. Today, however, networks are flooded with a large number of undesirable sites for pornography, gambling, fraud, etc. They not only harm people's mind, but can even seriously harm property safety. Therefore, the identification, monitoring, and management of the bad sites are very important.

The magnitude of main domain names circulating globally is about 2.6 hundred million, the newly added domain names are about 30 ten thousand dynamically each day, and the outdated domain names are about 30 ten thousand each day. At present, the main method for identifying the bad sites is based on the web page texts and the web page snapshots, but the time cost for acquiring and analyzing the web page texts and the web page snapshots is very high. Therefore, the existing system method for efficiently filtering the full-flow main domain name is lacked, so that the full-flow bad site domain name grey list cannot be effectively constructed.

Disclosure of Invention

The invention provides a method for filtering a bad site grey list based on a full circulation main domain name, aiming at the technical problems of long time consumption and high cost of the existing method for filtering the full domain name grey list based on a webpage text and a webpage snapshot.

Therefore, the technical scheme of the invention is that the method for filtering the bad site grey list based on the full-flow main domain name comprises the following steps:

step 1, extracting features from a character string of an existing bad site domain name, establishing a bad site keyword phrase library, and constructing a name discrimination model of the bad site domain name based on character similarity to realize coarse filtering of suspected bad site domain names in a full domain name;

step 2, an IP and port fast scanning model is constructed, service IP and port attribute information of the domain name of the suspected bad site are obtained, and whether the domain name can be analyzed and used for Web service is identified;

step 3, establishing an IP mapping range model of the domain name of the bad site through the existing bad site service IP group, and performing coarse filtering based on IP similarity;

step 4, classifying the geographical area of the domain name based on an IP positioning technology;

step 5, analyzing the accuracy of the domain name grey list of the bad sites obtained by rough filtering by utilizing the existing bad site identification technology;

and 6, performing iterative optimization on the coarse filtering step 1 and the coarse filtering step 3.

Furthermore, the structural form of the domain name of the bad site is divided into two categories, the first category is that the domain name contains English words or Chinese pinyin, the second category is that the domain name is formed by character sequences randomly, in the method based on the character similarity model, aiming at the first category of domain name, a color betting keyword phrase library is constructed to match the keywords, and aiming at the second category of domain name, the judgment whether the character sequences are generated randomly is carried out by training an LSTM neural network model.

Furthermore, the construction method of the color gambling keyword phrase library is that a 37 ten thousand English word dictionary and 405 Chinese pinyin are combined into an English Chinese pinyin dictionary, the longest word matching is carried out from a 39 ten thousand color gambling domain name set, and the high-frequency-appearing pornographic English pinyin phrases are extracted to form the color gambling keyword phrase library for the subsequent keyword matching filtering.

Further, the training method of the LSTM neural network model is that 70 ten thousand Alex domain names and 78 ten thousand random character sequence domain names are used as a training set and a testing set to train the LSTM model, and the LSTM neural network is divided into 3 layers: 1. the preprocessing layer expands the length of a domain name character sequence to 75, then maps character features into an integer index, and finally converts the positive integer index into a dense vector with a fixed size to be embedded as a character; 2. a long-short term memory layer, with the number of cells set to 128, and dropout set to 0.5 for avoiding overfitting; 3. and the output layer adopts 2 classification output.

Further, the method for coarsely filtering the domain name of the suspected bad site comprises the steps of firstly matching keywords through a constructed chromatic gambling keyword phrase library, judging whether the domain name contains chromatic gambling keyword phrases or not, if so, considering that the domain name can be used for the bad site, if not, judging the randomness of character sequences by using a trained LSTM neural network model, judging whether the domain name is composed of characters randomly or not, and if so, considering that the domain name can be used for the bad site.

Further, the method for performing coarse filtering based on IP similarity is to analyze the similarity of the IPs stored in step 2 through an existing IP mapping range model, and if the IP falls within a segment of mapping range of the model, the IP is considered to be used for the content with poor service.

Further, the specific method for performing iterative optimization on the coarse filtering step 1 and the coarse filtering step 3 is,

step S1, dynamically updating the color game keyword phrase library, adding the newly appeared color game English spelling phrases with high frequency into the phrase library, and deleting the phrases which are not used for a long time in the phrase library;

and step S2, dynamically updating the IP mapping range model, integrating the newly-appeared bad site service IP into the model, and reducing the IP range which is missed for a long time in the model.

The method has the advantages that when the grey list of bad sites of the full-flow main domain name is filtered, the 2.6 hundred million domain name magnitude range is reduced by 90% through the filtering of the similarity of the domain name characters and the service IP, the time consumption caused by acquiring and analyzing the webpage text and the snapshot is greatly reduced, and meanwhile, the full-flow domain name is efficiently and accurately filtered. The method provided by the invention can realize high-speed and high-precision filtration of the total domain name.

Drawings

FIG. 1 is a schematic diagram of the construction of a keyword phrase library, an LSTM neural network model and an IP mapping range model according to the present invention;

fig. 2 is a schematic flow chart of filtering the bad site grey list according to the present invention.

Detailed Description

The present invention will be further described with reference to the following examples.

As shown in FIG. 1, the first stage of the present invention requires two steps to construct a character similarity model and an IP mapping range model respectively. The method comprises the following specific steps:

step (1): when the domain names of the bad sites are analyzed, the domain name construction forms of the bad sites are divided into two categories. The first category is the domain name containing english words or chinese pinyin (different language forms, for chinese erotic gambling sites, domain names contain many forms of pinyin), for example: com, tiyubocai, cn, and the like. The second category is that domain names are composed of character sequences randomly (possibly generated randomly by an algorithm), for example: vdqw-96.com, 12034. cn. Therefore, in the method based on the character similarity model, aiming at the first class domain name, a color-gambling keyword phrase library is constructed to match the keywords. And aiming at the second class of domain names, judging whether the character sequences are randomly generated or not by training an LSTM neural network.

(1) Constructing a color gambling keyword phrase library: a dictionary of 37 ten thousand english words and 405 chinese pinyins (without phonetic symbols) are merged into an english-chinese pinyin dictionary. The longest word matching is carried out from the 39-thousand-color gambling domain name set, and the high-frequency-occurrence-frequency pornographic English spelling phrase is extracted to form a color gambling keyword phrase library for the subsequent keyword matching filtering.

(2) Training of the LSTM neural network model: the LSTM model is trained by using 70 ten thousand Alex domain names and 78 ten thousand random character sequence domain names (consisting of the random character sequence domain name and the DGA domain name in the 39 ten thousand gamble domain names) as a training set and a testing set. The neural network is divided into 3 layers: 1. and the preprocessing layer expands the length of the domain name character sequence to 75, then maps character features into integer indexes, and finally converts the positive integer indexes into dense vectors with fixed size to be used as characters for embedding. 2. Long-short term memory layer: the number of cells is set to 128 and dropout is set to 0.5 to avoid overfitting. 3. An output layer: a 2-class output is used. Finally, the accuracy was 94% in the training set and 96% in the test set.

Step (2): the existing 39 ten thousand bet domain names are subjected to DNS resolution to obtain all service IP addresses. Considering that when applying for using IP, a batch of consecutive IP addresses is usually applied as a backup. Therefore, the range mapping is carried out on all the gambling IP, and a model of the mapping range of the gambling IP is constructed for subsequent filtering.

As shown in fig. 2, a method for filtering a bad site grey list based on a full-flow main domain name includes the following specific steps:

step 1: extracting features from the existing domain name character strings of the bad sites, establishing a keyword phrase library of the bad sites, and constructing a domain name discrimination model of the bad sites based on character similarity to realize coarse filtering of suspected bad site domains in the full-scale domain names. And taking a 2.6-hundred million full-scale main domain name as input data to filter the character similarity of the domain name. Wherein, the filtration is divided into two parts to be executed. Firstly, matching keywords through a constructed color betting keyword phrase library, and judging whether a domain name contains a color betting keyword phrase. If the domain name exists, the domain name is considered to be possibly used for a bad site. If the sensitive keywords do not exist, the randomness of the character sequence is judged by using the trained LSTM model, and whether the domain name is composed of characters randomly or not is judged. If yes, the domain name is considered to be possibly used for a bad site. And performing coarse filtering on the domain names of the suspected bad sites through the two parts.

Step 2: and constructing an IP and port fast scanning model, acquiring domain name service IP and port attribute information of the suspected bad site, and identifying whether the domain name can be analyzed and used for Web service. And acquiring the service IP and the port attribute of the domain name set obtained in the last step. Through DNS resolution, the A record is obtained, and all available IP are stored. A port scan is then performed to see if the 80, 443, 8080, etc. ports are open, thereby filtering out the IP for the Web service.

And 3, step 3: and establishing an IP mapping range model of the domain name of the bad site through the existing bad site service IP group, and performing coarse filtering based on the IP similarity. And (3) carrying out similarity analysis on the IP stored in the step (2) through an existing IP mapping range model. If the IP falls within a mapping range of the model, the IP is considered to be used for serving the bad content. Coarse filtering of bad sites by IP similarity.

And 4, step 4: and classifying the geographic regions of the domain names based on the IP positioning technology. The IP physical address attribute is obtained through a service IP positioning technology and is subdivided into domestic and foreign. And corresponding and storing the IP obtained by filtering in the step with the domain name.

And 5: and analyzing the accuracy of the domain name grey list of the bad sites obtained by coarse filtering by utilizing the existing bad site identification technology. And accurately judging the domain name obtained by filtering the steps through the existing bad site judgment model. The discriminant model is based on the web page content and the snapshot, so that the time is consumed when the text content and the snapshot are acquired. However, through the filtering in the above steps, the domain name range has been reduced by 90%, and the domain name sets obtained through the filtering are all domain names highly suspected to be used for bad sites. Therefore, the step can efficiently filter out the bad site domain name grey list and evaluate the filtering effect of the step.

Step 6: and (3) performing iterative optimization coarse filtration, namely storing the gray list of the full-scale betting domain name, and performing iterative optimization of the step 1 and the step 3. The optimization method comprises the following specific steps: and step S1, dynamically updating the color-gambling keyword phrase library, adding the newly appeared and frequently-used color-gambling English spelling phrases into the phrase library, and deleting the phrases which are not used for a long time in the phrase library. And step S2, dynamically updating the IP mapping range model, integrating the newly-appeared bad site service IP into the model, and reducing the IP range missed for a long time in the model.

When the method filters the bad site grey list of the full-flow main domain name, the 2.6 hundred million domain name magnitude range is reduced by 90 percent through the filtering of the similarity of the domain name characters and the service IP, the time consumption caused by acquiring and analyzing the webpage text and the snapshot is greatly reduced, and simultaneously, the high-efficiency and accurate filtering of the full-flow domain name is realized. The method provided by the invention can realize high-speed and high-precision filtering of the full domain name.

However, the above description is only an example of the present invention, and the scope of the present invention should not be limited thereto, so that the substitution of the equivalent elements, or the equivalent changes and modifications made according to the claims should be included in the scope of the present invention.

Claims

1. A bad site grey list filtering method based on a full-flow main domain name is characterized by comprising the following steps:

step 2, an IP and port fast scanning model is constructed, service IP and port attribute information of a suspected bad site domain name are obtained, and whether the domain name can be analyzed and used for Web service is identified;

2. The method according to claim 1, wherein the method for filtering the bad site grey list based on the full-volume circulation main domain name comprises the following steps: the method based on the character similarity model is characterized in that a chromatic gambling keyword phrase library is constructed for the first class of domain names to match keywords, and whether the character sequences are randomly generated or not is judged by training an LSTM neural network model for the second class of domain names.

3. The method according to claim 2, wherein the method for filtering the bad site grey list based on the full-volume circulation main domain name comprises the following steps: the construction method of the color gambling keyword phrase library comprises the steps of combining a 37 ten thousand English word dictionary and 405 Chinese pinyin into an English Chinese pinyin dictionary, matching the longest word from a 39 ten thousand color gambling domain name set, extracting the high-frequency-occurrence-frequency-pornography English pinyin phrases to form the color gambling keyword phrase library, and filtering for subsequent keyword matching.

4. The method as claimed in claim 3, wherein the method for filtering the bad site grey list based on the full-volume circulation main domain name comprises: the LSTM neural network model training method is that 70 ten thousand Alex domain names and 78 ten thousand random character sequence domain names are used as a training set and a testing set to train the LSTM model, and the LSTM neural network is divided into 3 layers: 1. the preprocessing layer expands the length of a domain name character sequence to 75, then maps character features into an integer index, and finally converts the positive integer index into a dense vector with a fixed size to be embedded as a character; 2. a long-short term memory layer, setting the number of cells to 128, and setting dropout to 0.5 for avoiding overfitting; 3. and the output layer adopts 2 classification output.

5. The method according to claim 4, wherein the method for filtering the grey list of bad sites based on the full-volume circulation main domain name comprises the following steps: the method for roughly filtering the domain name of a suspected bad site comprises the steps of firstly matching keywords through a constructed color betting keyword phrase library, judging whether the domain name contains color betting keyword phrases or not, if so, judging that the domain name can be used for the bad site, if not, judging that the character sequence randomness is judged by using a trained LSTM neural network model, judging whether the domain name consists of characters randomly or not, and if so, judging that the domain name can be used for the bad site.

6. The method of claim 5, wherein the method comprises the following steps: the method for performing coarse filtering based on the IP similarity is to analyze the similarity of the IP stored in the step 2 through the existing IP mapping range model, and if the IP falls into a segment of mapping range of the model, the IP is considered to be used for the content with poor service.

7. The method according to claim 6, wherein the method for filtering the grey list of bad sites based on the full circulation major domain name comprises the following steps: the specific method for performing iterative optimization on the coarse filtering step 1 and the step 3 is,

step S1, dynamically updating the color-gambling keyword phrase library, adding the newly appeared and frequently-used color-gambling English spelling phrases into the phrase library, and deleting the phrases which are not used for a long time in the phrase library;

and step S2, dynamically updating the IP mapping range model, integrating the newly-appeared bad site service IP into the model, and reducing the IP range missed for a long time in the model.