CN114266251A - Malicious domain name detection method and device, electronic equipment and storage medium - Google Patents

Malicious domain name detection method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114266251A
CN114266251A CN202111616632.7A CN202111616632A CN114266251A CN 114266251 A CN114266251 A CN 114266251A CN 202111616632 A CN202111616632 A CN 202111616632A CN 114266251 A CN114266251 A CN 114266251A
Authority
CN
China
Prior art keywords
domain name
detected
participles
malicious
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111616632.7A
Other languages
Chinese (zh)
Inventor
李金辉
崔元浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Original Assignee
Beijing Topsec Technology Co Ltd
Beijing Topsec Network Security Technology Co Ltd
Beijing Topsec Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Topsec Technology Co Ltd, Beijing Topsec Network Security Technology Co Ltd, Beijing Topsec Software Co Ltd filed Critical Beijing Topsec Technology Co Ltd
Priority to CN202111616632.7A priority Critical patent/CN114266251A/en
Publication of CN114266251A publication Critical patent/CN114266251A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a malicious domain name detection method and device, electronic equipment and a storage medium, and relates to the technical field of safety. The method includes the steps that a plurality of participles of a domain name to be detected are obtained, the participles are input into a fastText model, whether the domain name to be detected is a malicious domain name or not is detected through the fastText model, a detection result is obtained, the fastText model can achieve a text classification function through a shallow neural network, and in a text classification task, the shallow neural network can often achieve the precision which is comparable to that of a deep neural network, so that the fastText model is adopted to detect the malicious domain name, the detection precision which can be achieved by the deep neural network can be achieved, and compared with a detection mode based on a regular expression or a black and white list in the existing mode, the detection precision of the scheme is higher.

Description

Malicious domain name detection method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of security technologies, and in particular, to a malicious domain name detection method and apparatus, an electronic device, and a storage medium.
Background
With the continuous development of internet technology, the network security problem is increasingly prominent. At present, a large number of Domain name Definitions (DGAs) can be generated quickly through a Domain Generation Algorithm (DGA), and a botnet with good robustness can be constructed through the DGA. An attacker, using a botnet, may launch a network attack on a device in the network.
In order to improve network security, a DGA domain name needs to be detected, and a currently common detection technology is based on a regular expression or a black and white list for detection, but because the DGA domain name is easy to generate and changes rapidly, a detection mode of using a preset regular expression or a black and white list for the DGA domain name has a high false alarm rate.
Disclosure of Invention
An object of the embodiments of the present application is to provide a malicious domain name detection method, apparatus, electronic device, and storage medium, so as to solve the problems of a domain name detection method in the prior art, such as high false alarm rate and low accuracy.
In a first aspect, an embodiment of the present application provides various malicious domain name detection methods, where the method includes:
acquiring a domain name to be detected;
performing word segmentation on the domain name to be detected to obtain a plurality of word segments;
and inputting the plurality of participles into a fastText model, and detecting whether the domain name to be detected is a malicious domain name or not through the fastText model to obtain a detection result.
In the implementation process, the method includes the steps that a plurality of participles of a domain name to be detected are obtained, the participles are input into a fastText model, whether the domain name to be detected is a malicious domain name or not is detected through the fastText model, and a detection result is obtained.
Optionally, the fastText model includes an input layer, a hidden layer, and an output layer, where the multiple participles are used as input of the input layer, the input layer is configured to convert the multiple participles into corresponding word vectors, the hidden layer is configured to perform superposition average processing on the word vectors, and the output layer is configured to output a detection result of the domain name to be detected based on a processing result of the hidden layer. The fastText model combines the thought of natural language processing and the hierarchical thought of the neural network, so that the incidence relation among all the participles can be better extracted, and the prediction precision of the neural network can be achieved.
Optionally, the segmenting the domain name to be detected to obtain a plurality of segments includes:
and performing word segmentation on the domain name to be detected by adopting an N-gram model to obtain a plurality of word segmentations. So can be better to treating the domain name of waiting to detect the word to do benefit to the fattText model and can more accurately analyze each word segmentation sequence's rationality and to the constraint information that next word segmentation appears, have bigger discrimination, and then improve the detection precision of malicious domain name.
Optionally, the segmenting the domain name to be detected to obtain a plurality of segments includes:
performing word segmentation on the domain name to be detected by adopting an N-gram model to obtain a plurality of initial word segmentations;
analyzing the association relation among the initial participles by adopting a data association analysis method;
and screening out a plurality of participles with strong association relation according to the association relation among the initial participles.
In the implementation process, the relevance between the initial participles is mined by adopting a data relevance analysis method, and the participles with strong relevance are screened out, so that the relationship between the participles with strong relevance extracted by the fastText model can be further improved, and the detection precision is improved.
Optionally, the segmenting the domain name to be detected to obtain a plurality of segments includes:
performing word segmentation on the domain name to be detected by adopting an N-gram model to obtain a plurality of initial word segmentations;
combining the plurality of initial participles according to a set combination mode to obtain a plurality of combined participles;
wherein the input data of the fastText model includes the plurality of initial participles and the combined plurality of participles.
In the implementation process, the initial participles are combined, so that the data volume of the input fastText model can be increased, the fastText model can detect malicious domain names by using more data, and the detection precision is improved.
Optionally, after obtaining the domain name to be detected, performing word segmentation on the domain name to be detected, and before obtaining a plurality of word segments, the method further includes:
similarity calculation is carried out on the domain name to be detected and a plurality of prestored malicious domain names, and the similarity between the domain name to be detected and each malicious domain name is obtained;
if the number of the similarity degrees larger than the set similarity degrees exceeds a set proportion, determining that the domain name to be detected is a suspicious domain name, and executing the following steps: and performing word segmentation on the domain name to be detected.
In the implementation process, the similarity calculation is carried out on the domain name to be detected and a plurality of malicious domain names in advance, and the subsequent detection is carried out when the domain name to be detected is a suspicious domain name, so that the detection amount of a subsequent fastText model can be reduced, and the detection efficiency can be improved when a large number of domain names need to be detected in the network.
Optionally, the calculating similarity between the domain name to be detected and a plurality of pre-stored malicious domain names to obtain the similarity between the domain name to be detected and each malicious domain name includes:
calculating fuzzy hash values of the domain names to be detected, and calculating the prestored fuzzy hash values of all malicious domain names;
and comparing the similarity of the fuzzy hash value of the domain name to be detected with the similarity of the fuzzy hash value of each malicious domain name to obtain the similarity between the domain name to be detected and each malicious domain name.
In the implementation process, the fuzzy hash value can more accurately compare the similarity between two character strings, because the hash of a certain fragment of a domain name is only affected by changes such as addition, modification, deletion and the like of data in the fragment, and the influence on the whole situation is not large, the influence on the final similarity is not large, even if continuous characters are changed or a plurality of changes are made, the fuzzy hash algorithm can still make effective judgment, and further the similarity between two domain names can be more accurately judged by comparing the fuzzy hash value between the domain names.
In a second aspect, an embodiment of the present application provides a malicious domain name detection apparatus, where the apparatus includes:
the domain name acquisition module is used for acquiring a domain name to be detected;
the word segmentation module is used for segmenting the domain name to be detected to obtain a plurality of segmented words;
and the detection module is used for inputting the plurality of participles into a fastText model, detecting whether the domain name to be detected is a malicious domain name or not through the fastText model, and obtaining a detection result.
Optionally, the fastText model includes an input layer, a hidden layer, and an output layer, where the multiple participles are used as input of the input layer, the input layer is configured to convert the multiple participles into corresponding word vectors, the hidden layer is configured to perform superposition average processing on the word vectors, and the output layer is configured to output a detection result of the domain name to be detected based on a processing result of the hidden layer.
Optionally, the word segmentation module is configured to perform word segmentation on the domain name to be detected by using an N-gram model to obtain a plurality of words.
Optionally, the word segmentation module is configured to perform word segmentation on the domain name to be detected by using an N-gram model to obtain a plurality of initial word segments; analyzing the association relation among the initial participles by adopting a data association analysis method; and screening out a plurality of participles with strong association relation according to the association relation among the initial participles.
Optionally, the word segmentation module is configured to perform word segmentation on the domain name to be detected by using an N-gram model to obtain a plurality of initial word segments; combining the plurality of initial participles according to a set combination mode to obtain a plurality of combined participles; wherein the input data of the fastText model includes the plurality of initial participles and the combined plurality of participles.
Optionally, the apparatus further comprises:
the pre-detection module is used for calculating the similarity between the domain name to be detected and a plurality of prestored malicious domain names to obtain the similarity between the domain name to be detected and each malicious domain name; and if the number of the similarity degrees larger than the set similarity degrees exceeds a set proportion, determining the domain name to be detected as a suspicious domain name.
Optionally, the pre-detection module is configured to calculate a fuzzy hash value of the domain name to be detected, and calculate a pre-stored fuzzy hash value of each malicious domain name; and comparing the similarity of the fuzzy hash value of the domain name to be detected with the similarity of the fuzzy hash value of each malicious domain name to obtain the similarity between the domain name to be detected and each malicious domain name.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor and a memory, where the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the steps in the method as provided in the first aspect are executed.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, performs the steps in the method as provided in the first aspect above.
Additional features and advantages of the present application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the embodiments of the present application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a flowchart of a malicious domain name detection method according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a fastText model provided in an embodiment of the present application;
fig. 3 is a block diagram illustrating a structure of a malicious domain name detection apparatus according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device for executing a malicious domain name detection method according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
It should be noted that the terms "system" and "network" in the embodiments of the present invention may be used interchangeably. The "plurality" means two or more, and in view of this, the "plurality" may also be understood as "at least two" in the embodiments of the present invention. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" generally indicates that the preceding and following related objects are in an "or" relationship, unless otherwise specified.
The embodiment of the application provides a malicious domain name detection method, the method includes the steps of obtaining a plurality of participles of a domain name to be detected, inputting the participles into a fastText model, detecting whether the domain name to be detected is a malicious domain name or not through the fastText model, and obtaining a detection result. And, the fast text model degree of depth does not have the deep learning model again deeply, and the testing process is comparatively simple, and the time is shorter, so compare in the deep learning model, adopt the fast text model to carry out the detection of malicious domain name in this application, under the condition that can ensure the precision, shorten check-out time.
Referring to fig. 1, fig. 1 is a flowchart of a malicious domain name detection method according to an embodiment of the present disclosure, where the method includes the following steps:
step S110: and acquiring the domain name to be detected.
The domain name to be detected may refer to any domain name that needs to be subjected to malicious domain name detection, for example, any domain name received by the security protection device may be used as the domain name to be detected, and then the malicious domain name is detected according to subsequent steps to identify the received malicious domain name, so that a situation that the malicious domain name threatens network security may be avoided in advance, for example, the malicious domain name is intercepted, or other protection measures are adopted to process the malicious domain name.
Step S120: and performing word segmentation on the domain name to be detected to obtain a plurality of word segments.
The method for segmenting the domain name to be detected can adopt any segmentation method, such as a word2vec segmentation method, or adopt an N-gram model to segment the domain name to be detected to obtain a plurality of segmented words.
Because the difference between the normal domain name and the malicious domain name is mainly embodied in the character composition of the domain name, the domain name to be detected can be segmented by adopting an N-gram model in the embodiment of the application, the domain name can be segmented into N-gram characteristics of word granularity and N-gram characteristics of character granularity according to the difference of the segmentation granularity during segmentation, the N-gram is an algorithm based on a statistical language model, and the basic idea is that the content in the text is subjected to sliding window operation with the size of N according to bytes to form a byte fragment sequence with the length of N. Each byte fragment is called a gram, and forms a list of grams, i.e., a vector feature space of the text, where each gram in the list is a feature vector dimension. In the embodiment of the application, word segmentation can be performed by adopting a character granularity unigram (unary word segmentation, each character of a domain name to be detected from head to tail forms a word), bigram (binary word segmentation, each two characters of the domain name to be detected from head to tail form a word) and trigram (ternary word segmentation, each three characters of the domain name to be detected from head to tail form a word), characters after the domain name word segmentation are taken as independent words, all character words after the domain name word segmentation are taken as a sentence, the rationality of the arrangement sequence of the characters of the domain name to be detected is analyzed, more constraint information appears on the next word, and the discrimination is higher.
Cn, for example, the domain name to be detected is ***, then the unigram feature, bigram feature and trigram feature of the domain name are extracted, and the domain name is divided according to the division window n ═ 1, n ═ 2 and n ═ 3, so that the participle set of [ "g", "o", "o", "g", "l", "e", "$ g", "go", "oo", "og", "gl", "le", "e $", "$go", "goo", "oog", "ogl", "gle", "le $" ] can be obtained, where the characters $ identify the beginning and end of the domain name.
Step S130: and inputting the plurality of participles into a fastText model, and detecting whether the domain name to be detected is a malicious domain name or not through the fastText model to obtain a detection result.
After the domain name to be detected is segmented according to the corresponding segmentation method, the obtained multiple segmented words are input into a fastText model, and the domain name to be detected is detected as a malicious domain name based on the multiple segmented words through the fastText model. In some embodiments, as shown in FIG. 2, the fastText model includes an input layer, a hidden layer, and an output layer, a plurality of participles (x)iRepresenting a participle) can be input as an input layerThe hidden layer is used for performing superposition average processing on the word vectors, the output layer is used for outputting a detection result of the domain name to be detected based on a processing result of the hidden layer, and the detection result can be a category to which the domain name to be detected belongs, such as a malicious domain name and a normal domain name.
In some text classification tasks, the categories are many, and the complexity of calculating a linear classifier is high. To improve runtime, the fastText model uses hierarchical softmax techniques. The hierarchical softmax technique is established on the basis of Huffman coding, the label is coded, and the number of model prediction results can be greatly reduced. The fastText model also exploits the fact that classes (classes) are unbalanced (some classes occur more often than others), by using the Huffman method to build a tree structure that characterizes the classes. Therefore, the depth of the tree structure of the frequently occurring category is smaller than the depth of the tree structure of the infrequently occurring category, which further makes the model more computationally efficient.
The training process of the fastText model is briefly described below.
In the training process, a large number of malicious domain names and normal domain names can be obtained from the network as training samples, for example, the domain names with the top rank of 100 ten thousand can be selected from an Alexa website, the part of data is marked as a positive sample, the other part of data is a DGA seed file obtained from a security laboratory, the DGA seed file contains about 40 DGA data of various malicious domain name families, nearly 120 ten thousand pieces of data are obtained, and the part of data is marked as a negative sample. The positive samples and the negative samples can be subjected to word segmentation through the N-grama model, the obtained multiple words can be divided into a training set and a testing set according to the ratio of 6:4, domain name family information is also required to be considered for dividing the negative samples, and the characteristics of all types of domain names are ensured to appear in the training set. Constructing input data according to the requirements of the fastText model, and respectively dividing the training set and the test set in the previous step according to the format: "__ label __ + tag \ t + segmented fragment list" is stored in the fasttext _ train.txt and fasttext _ text.txt files, such as the following part of data in the fasttext _ train.txt file:
__lable__abnormal c h k r f m w s i h a i t l j$c ch hk kr rf fm mw ws si ih ha ai it tl lj j$$ch chk hkr krf rfm fmw mws wsi sih iha hai ait itl tlj lj$;
__lable__normal g o o g l e$g go oo og gl le e$$go goo oog ogl gle le$;
wherein, abrormal is the label of the negative sample, and normal is the label of the positive sample.
The model may then be trained using the model training function train _ survived provided by fastText, where the model training function is as follows:
classifier=fasttext.train_supervised("fasttext_train.txt");
and the function returns the class of the model object trained on the training set.
The model object may then be saved to a specified file, such as a classier.save _ model ("fasttext.model.bin"), using a model save function, where the input parameter fasttext.model.bin is the name of the saved file, and may specify that the save path will be saved under the current directory by default.
Therefore, the fastText model has the biggest characteristic that the model is simple and only has a hidden layer and an output layer, so that the training speed is very high, and the training at the minute level can be realized on a common CPU (central processing unit), which is several orders of magnitude faster than the training of a deep learning model. The fastText model combines the thought of natural language processing and the layering thought of a neural network, and a scheme of a rapid text classifier is constructed, so that the prediction efficiency can be improved under the condition of ensuring the detection precision.
After the fastText model is trained, the following process can be used to test the performance of the trained fastText model.
Firstly, loading a model: load _ model ("fasttext.model.bin"), wherein fasttext _ text.txt is test data and the format is consistent with fasttext _ train.txt. Inputting the test set into the fastText model to obtain the accuracy and recall ratio of the model on the test set, as follows:
result=classifier.test("fasttext_test.txt");
precision is the accuracy of the model, and result is the recall rate of the model.
And if the accuracy or the recall rate of the model does not meet the set requirements, re-using a new training set to train the fastText model until the accuracy and the recall rate of the fastText model after training meet the set requirements.
The difference between the fastText model and other models in training time is compared through experimental data, for example, model training for malicious domain name detection is performed by using two models, namely a Support Vector Machine (SVC) model and a Long short-term memory (LSTM) model, and the model training is transversely compared with the fastText model, wherein the SVC model is a Machine learning model, the LSTM model is a deep learning model, the following table is a preliminary classification condition of the models trained in 3 modes under the same data set and Machine condition, the classification effect of the same deep learning model using the fastText model is equivalent through evaluation indexes in the table, and both models are larger than the Machine learning model.
Model (model) Rate of accuracy Rate of accuracy Recall rate
SVC 0.9444 0.9166 0.9583
LSTM 0.9724 0.9708 0.9891
fastText 0.9745 0.9622 0.9965
In addition, experiments show that the FastText LSTM SVC model training time is more than 1 hour, the use time of the model training of machine learning and deep learning is more than 1 hour, the use time of the FastText model is only a few minutes, and the required time of the training model is greatly shortened while the detection effect is ensured. Therefore, the adoption of the fastText model for predicting the malicious domain name can simultaneously ensure high prediction precision and low prediction duration.
When a fastText model is used for detecting a domain name to be detected, if a prediction function of the model is used for performing classification prediction on the domain name to be detected, if a classification label is predicted to be normal, the domain name to be detected is a normal domain name; and if the predicted classification label is the abnormal, indicating that the domain name to be detected is a malicious domain name.
Prediction processes such as pre ═ classifier.
Wherein pre represents a detection result, and if the obtained detection result is: ('__ label __ abrormal'), array ([0.99966717])), show that the detection result is abrormal, indicate that the domain name "jmowysfox. org" to be detected is a malicious domain name, and 0.99966717 shows the probability that the domain name to be detected is a malicious domain name.
In some other embodiments, in order to further improve the detection accuracy, the hidden layer in the fastText model may also be set to multiple layers, such as two or three layers, so that hidden features in more input participles can be extracted through the hidden layer, thereby improving the prediction accuracy. It can be understood that, when the hidden layer is a multilayer, the training process and the prediction process of the model are similar to those described above, and will not be described in detail herein.
In the implementation process, a plurality of participles of the domain name to be detected are obtained, the participles are input into the fastText model, whether the domain name to be detected is a malicious domain name is detected through the fastText model, a detection result is obtained, the fastText model can realize a text classification function by using a shallow neural network, and in a text classification task, the shallow neural network can often obtain the precision which is comparable to that of a deep network, so that the fastText model is adopted to detect the malicious domain name, the detection precision which can be realized by the deep neural network can be achieved, and compared with the detection mode based on a regular expression or a black and white list in the existing mode, the detection precision of the scheme is higher.
On the basis of the above embodiment, in order to further mine the association relationship among the participles and further improve the prediction accuracy of the fastText model, the N-gram model may be used to perform the participle on the domain name to be detected to obtain a plurality of initial participles, and then the association relationship among the initial participles may be analyzed by a data association analysis method, and a plurality of participles with strong association relationship are screened out according to the association relationship among the initial participles.
The method for performing word segmentation on the domain name to be detected by using the N-gram model may refer to the related description in the above embodiments, and will not be described in detail herein. The association relationship between the initial participles can be understood as the probability that the second initial participle appears when the first initial participle appears in the two initial participles. The probability can be obtained through a transaction set counted in advance, the transaction set refers to a combination formed by a large number of malicious domain names, each domain name can be called a transaction, one transaction comprises a plurality of items, and one item can refer to one participle.
Several concepts arising from data association analysis are introduced below:
association rule (i.e., association relationship): for finding the connection between the initial participles, the association rule is an implication in the form of X → Y, where X and Y are referred to as the predecessor and successor of the association rule, respectively.
The support degree is as follows: the support of the association rule X → Y is denoted as Supp (X → Y), which represents the ratio of the frequency of occurrence of the item sets X and Y in the transaction set.
Confidence coefficient: also known as confidence, the confidence of the association rule X → Y can be denoted as conf (X → Y), which is the ratio of the number of occurrences of the sets of items X and Y in the transaction set to the number of occurrences of the set of items X in the transaction set.
The promotion degree represents the probability that, in the case of an item set X in one transaction, there is an item set Y at the same time, which reflects the correlation between the item set X and the item set Y in the association rule.
Strong association rules (i.e., strong association relationships): to represent valuable association rules.
For example, the implicit meaning of association rule { go } → { o } is the probability that the participle "o" will occur in the case of the participle "go", and if the probability is large, the association rule is called a strong association rule, meaning that the participle "o" will occur in the case of the participle "go" approximately after the occurrence.
Therefore, after performing word segmentation on a domain name to be detected and obtaining a plurality of initial word segmentation, data such as support degree, confidence degree or promotion degree between every two adjacent initial word segmentation can be calculated through a transaction set, and then whether a strong association rule (i.e. a strong association relationship) exists between the two adjacent initial word segmentation can be judged according to the data, for example, if the support degree of the two adjacent initial word segmentation is greater than a set support degree threshold, the two adjacent initial word segmentation is considered to have the strong association relationship, or if the confidence degree of the two adjacent initial word segmentation is greater than a set confidence degree threshold, the two adjacent initial word segmentation is considered to have the strong association relationship, or if the promotion degree of the two adjacent initial word segmentation is greater than a promotion degree threshold, the two adjacent initial word segmentation is considered to have the strong association relationship. Of course, if at least one of the support degree, the confidence degree and the promotion degree satisfies the condition, it can be considered that two adjacent initial participles have a strong association relationship.
Or, not only the association relation analysis can be performed on two adjacent initial participles, but also the association relation analysis can be performed on any two initial participles, the analysis modes are similar, so that a plurality of initial participles with strong association relation can be screened out, the initial participles can be used as a plurality of participles input into the fastText model, the association relation among the participles can be preliminarily analyzed through a data association analysis method, then the participles with strong association relation are input into the fastText model, whether a certain relation exists among the participles with strong association relation can be further analyzed through the fastText model, and the accuracy of malicious domain name detection can be further improved.
In other embodiments, in order to increase the data volume of the input fastText model, so that the fastText model can detect malicious domain names based on more data volume, a N-gram model may be used to perform word segmentation on the domain names to be detected, and after a plurality of initial word segmentations are obtained, the plurality of initial word segmentations are combined in a set combination manner to obtain a plurality of combined word segmentations.
The combination mode may be set, for example, at least every two adjacent initial segmentations are combined, or at least any two initial segmentations in the multiple initial segmentations are combined, or other combination modes may be provided, for example, the multiple initial segmentations with strong association relationship screened by the data association analysis method may be combined adjacently in pairs or in any combination. Can obtain the participle after a large amount of combinations so after the combination, then can all input the fast text model with the participle after the combination and a plurality of initial participles for the input data volume of fast text model is more, and the fast text model can be through the implicit relation in the analysis mass data like this, comes the analysis to detect the rationality of the permutation sequence between each participle in the domain name, and then the fast text model can utilize more data to detect malicious domain name, improves and detects the precision.
On the basis of the above embodiment, in order to reduce the prediction workload of the fastText model, after the domain name to be detected is obtained, the domain name to be detected may be preliminarily detected, and if the domain name to be detected and a plurality of malicious domain names stored in advance are subjected to similarity calculation, the similarity between the domain name to be detected and each malicious domain name is obtained, and if the similarity is greater than the set similarity, the number of the similarity exceeds the set proportion, the domain name to be detected is determined to be a suspicious domain name, and then the subsequent word segmentation processing is performed on the domain name to be detected.
Certainly, the domain name to be detected can be compared with each normal domain name in a pre-constructed normal domain name library, if the domain name to be detected is consistent with a certain normal domain name in comparison, the domain name to be detected is indicated to be a normal domain name, and at the moment, subsequent processing can not be performed on the domain name to be detected, so that for the domain name to be detected which is originally a normal domain name, a fastText model is not required to be used for detection, the detection quantity of the fastText model is reduced, and the detection efficiency is improved. And if the domain name to be detected is inconsistent with each normal domain name in the normal domain name library, comparing the similarity of the domain name to be detected with each malicious domain name.
Counting a large number of malicious domain names in advance, if 1000 malicious domain names exist in the database, respectively comparing the similarity of the domain name to be detected with each malicious domain name, if the similarity is greater than the set similarity, the domain name to be detected is regarded as a suspicious malicious domain name, if the set similarity is set to 80% (the set similarity can be flexibly set according to actual requirements), and the set proportion is set to 60% (the set similarity can also be flexibly set according to actual requirements), so if the similarity is greater than 80% and the number exceeds 60%, that is, that more than 600 malicious domain names and the domain name to be detected are more than 80%, that is, the domain name to be detected is regarded as a malicious domain name which is very large and possibly, that is, a suspicious domain name.
In the implementation process, the similarity calculation is carried out on the domain name to be detected and a plurality of malicious domain names in advance, and the subsequent detection is carried out when the domain name to be detected is a suspicious domain name, so that the detection amount of a subsequent fastText model can be reduced, and the detection efficiency can be improved when a large number of domain names need to be detected in the network.
In the above embodiment, when the similarity between the domain name to be detected and the malicious domain name is calculated, the domain name to be detected and the malicious domain name may be converted into word vectors, and then cosine similarity or euclidean distance between the two word vectors is compared, so that the similarity between the two domain names may be obtained.
Or, the fuzzy hash value of the domain name to be detected and the fuzzy hash values of the malicious domain names can be calculated, and then the fuzzy hash values of the domain name to be detected and the fuzzy hash values of the malicious domain names are compared in similarity to obtain the similarity between the domain name to be detected and the malicious domain names.
Fuzzy hash values of the domain name to be detected and each malicious domain name are calculated, and a fuzzy hash value of the file can be calculated by processing the fuzzy hash values by using a current mature text segmentation-based segment hash algorithm (CTPH).
The algorithm is as follows:
1. slicing: reading a part of content in the domain name, and calculating by a weak hash algorithm to obtain a hash value. Fixed length content is typically read byte by byte, sliding in a fixed window in the domain name just like a sliding window in network protocols, each time for the content within the window. Therefore, for convenience, a rolling hashing algorithm (rolling hashing) is generally employed. The rolling hash here means that, for example, the hash value h1 of abcdef is originally calculated, and then the hash value of bcdefg is calculated, and it is not necessary to perform a complete recalculation, and only h 1-x (a) + y (g) is needed, where X, Y is two functions, that is, only the influence of the corresponding increase and decrease of the difference on the hash value is needed. The hash can greatly accelerate the slicing judgment.
The commonly used Alder-32[4] algorithm and CRC32 algorithm can be used as weak hash algorithm for fragmentation. In addition to the weak hash algorithm, a fragmentation trigger value n is required, which controls the fragmentation conditions.
In order to avoid the situation that the fragmentation condition is triggered only once due to the small number of fragments or the fragmentation is not triggered, how to fragment the domain name can be determined according to the length of the domain name and the actual content of the domain name.
2. Hash value is calculated for each slice
For each sliced piece, the hash value may be calculated using a conventional algorithm, such as MD5, or may be calculated using a hashing algorithm such as Fowler-Noll-Voh hash [5 ].
3. Concatenating hash values
And connecting the hash values of the pieces together to obtain the fuzzy hash value of the whole domain name. If the fragmentation condition parameter n has different values for different domain names, n also needs to be included in the fuzzy hash value.
Therefore, the fuzzy hash values of the domain name to be detected and each malicious domain name can be calculated according to the method, then the fuzzy hash value of the domain name to be detected and the fuzzy hash value of the malicious domain name are compared in similarity, the similarity comparison method can adopt an arithmetic such as hamming distance or minimum editing distance to calculate the distance value, and the distance value can be used for evaluating the similarity.
In the implementation process, the fuzzy hash value can more accurately compare the similarity between two character strings, because the hash of a certain fragment of a domain name is only affected by changes such as addition, modification, deletion and the like of data in the fragment, and the influence on the whole situation is not large, the influence on the final similarity is not large, even if continuous characters are changed or a plurality of changes are made, the fuzzy hash algorithm can still make effective judgment, and further the similarity between two domain names can be more accurately judged by comparing the fuzzy hash value between the domain names.
Referring to fig. 3, fig. 3 is a block diagram of a malicious domain name detection apparatus 200 according to an embodiment of the present disclosure, where the apparatus 200 may be a module, a program segment, or a code on an electronic device. It should be understood that the apparatus 200 corresponds to the above-mentioned embodiment of the method of fig. 1, and can perform various steps related to the embodiment of the method of fig. 1, and the specific functions of the apparatus 200 can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy.
Optionally, the apparatus 200 comprises:
a domain name obtaining module 210, configured to obtain a domain name to be detected;
a word segmentation module 220, configured to segment words of the domain name to be detected to obtain multiple segments;
the detecting module 230 is configured to input the multiple segments into a fastText model, and detect whether the domain name to be detected is a malicious domain name through the fastText model to obtain a detection result.
Optionally, the fastText model includes an input layer, a hidden layer, and an output layer, where the multiple participles are used as input of the input layer, the input layer is configured to convert the multiple participles into corresponding word vectors, the hidden layer is configured to perform superposition average processing on the word vectors, and the output layer is configured to output a detection result of the domain name to be detected based on a processing result of the hidden layer.
Optionally, the word segmentation module 220 is configured to perform word segmentation on the domain name to be detected by using an N-gram model to obtain a plurality of words.
Optionally, the word segmentation module 220 is configured to perform word segmentation on the domain name to be detected by using an N-gram model to obtain a plurality of initial word segmentations; analyzing the association relation among the initial participles by adopting a data association analysis method; and screening out a plurality of participles with strong association relation according to the association relation among the initial participles.
Optionally, the word segmentation module 220 is configured to perform word segmentation on the domain name to be detected by using an N-gram model to obtain a plurality of initial word segmentations; combining the plurality of initial participles according to a set combination mode to obtain a plurality of combined participles; wherein the input data of the fastText model includes the plurality of initial participles and the combined plurality of participles.
Optionally, the apparatus 200 further comprises:
the pre-detection module is used for calculating the similarity between the domain name to be detected and a plurality of prestored malicious domain names to obtain the similarity between the domain name to be detected and each malicious domain name; and if the number of the similarity degrees larger than the set similarity degrees exceeds a set proportion, determining the domain name to be detected as a suspicious domain name.
Optionally, the pre-detection module is configured to calculate a fuzzy hash value of the domain name to be detected, and calculate a pre-stored fuzzy hash value of each malicious domain name; and comparing the similarity of the fuzzy hash value of the domain name to be detected with the similarity of the fuzzy hash value of each malicious domain name to obtain the similarity between the domain name to be detected and each malicious domain name.
It should be noted that, for the convenience and brevity of description, the specific working procedure of the above-described apparatus may refer to the corresponding procedure in the foregoing method embodiment, and the description is not repeated herein.
Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device for executing a malicious domain name detection method according to an embodiment of the present disclosure, where the electronic device may include: at least one processor 310, such as a CPU, at least one communication interface 320, at least one memory 330, and at least one communication bus 340. Wherein the communication bus 340 is used for realizing direct connection communication of these components. The communication interface 320 of the device in the embodiment of the present application is used for performing signaling or data communication with other node devices. The memory 330 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 330 may optionally be at least one memory device located remotely from the aforementioned processor. The memory 330 stores computer readable instructions, which when executed by the processor 310, cause the electronic device to perform the method processes described above with reference to fig. 1.
It will be appreciated that the configuration shown in fig. 4 is merely illustrative and that the electronic device may include more or fewer components than shown in fig. 4 or may have a different configuration than shown in fig. 4. The components shown in fig. 4 may be implemented in hardware, software, or a combination thereof.
Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs the method processes performed by an electronic device in the method embodiment shown in fig. 1.
The present embodiments disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example, comprising: acquiring a domain name to be detected; performing word segmentation on the domain name to be detected to obtain a plurality of word segments; and inputting the plurality of participles into a fastText model, and detecting whether the domain name to be detected is a malicious domain name or not through the fastText model to obtain a detection result.
In summary, the embodiments of the present application provide a malicious domain name detection method, apparatus, electronic device, and storage medium, where the method obtains a plurality of participles of a domain name to be detected, inputs the participles into a fastText model, and detects whether the domain name to be detected is a malicious domain name through the fastText model, so as to obtain a detection result, the fastText model can implement a text classification function using a shallow neural network, and in a text classification task, the shallow neural network can often obtain a precision comparable to that of a deep neural network, so that the fastText model is used to detect the malicious domain name, and the detection precision that can be implemented by the deep neural network can be achieved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A malicious domain name detection method, characterized in that the method comprises:
acquiring a domain name to be detected;
performing word segmentation on the domain name to be detected to obtain a plurality of word segments;
and inputting the plurality of participles into a fastText model, and detecting whether the domain name to be detected is a malicious domain name or not through the fastText model to obtain a detection result.
2. The method according to claim 1, wherein the fastText model includes an input layer, a hidden layer and an output layer, the multiple participles are used as input of the input layer, the input layer is used for converting the multiple participles into corresponding word vectors, the hidden layer is used for performing superposition average processing on the word vectors, and the output layer is used for outputting a detection result of the domain name to be detected based on a processing result of the hidden layer.
3. The method according to claim 1, wherein the segmenting the domain name to be detected to obtain a plurality of segments comprises:
and performing word segmentation on the domain name to be detected by adopting an N-gram model to obtain a plurality of word segmentations.
4. The method according to claim 1, wherein the segmenting the domain name to be detected to obtain a plurality of segments comprises:
performing word segmentation on the domain name to be detected by adopting an N-gram model to obtain a plurality of initial word segmentations;
analyzing the association relation among the initial participles by adopting a data association analysis method;
and screening out a plurality of participles with strong association relation according to the association relation among the initial participles.
5. The method according to claim 1, wherein the segmenting the domain name to be detected to obtain a plurality of segments comprises:
performing word segmentation on the domain name to be detected by adopting an N-gram model to obtain a plurality of initial word segmentations;
combining the plurality of initial participles according to a set combination mode to obtain a plurality of combined participles;
wherein the input data of the fastText model includes the plurality of initial participles and the combined plurality of participles.
6. The method according to claim 1, wherein after the domain name to be detected is obtained, the domain name to be detected is subjected to word segmentation, and before a plurality of word segments are obtained, the method further comprises:
similarity calculation is carried out on the domain name to be detected and a plurality of prestored malicious domain names, and the similarity between the domain name to be detected and each malicious domain name is obtained;
if the number of the similarity degrees larger than the set similarity degrees exceeds a set proportion, determining that the domain name to be detected is a suspicious domain name, and executing the following steps: and performing word segmentation on the domain name to be detected.
7. The method according to claim 6, wherein the calculating the similarity between the domain name to be detected and a plurality of pre-stored malicious domain names to obtain the similarity between the domain name to be detected and each malicious domain name comprises:
calculating fuzzy hash values of the domain names to be detected, and calculating the prestored fuzzy hash values of all malicious domain names;
and comparing the similarity of the fuzzy hash value of the domain name to be detected with the similarity of the fuzzy hash value of each malicious domain name to obtain the similarity between the domain name to be detected and each malicious domain name.
8. An apparatus for malicious domain name detection, the apparatus comprising:
the domain name acquisition module is used for acquiring a domain name to be detected;
the word segmentation module is used for segmenting the domain name to be detected to obtain a plurality of segmented words;
and the detection module is used for inputting the plurality of participles into a fastText model, detecting whether the domain name to be detected is a malicious domain name or not through the fastText model, and obtaining a detection result.
9. An electronic device comprising a processor and a memory, the memory storing computer readable instructions that, when executed by the processor, perform the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202111616632.7A 2021-12-27 2021-12-27 Malicious domain name detection method and device, electronic equipment and storage medium Pending CN114266251A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111616632.7A CN114266251A (en) 2021-12-27 2021-12-27 Malicious domain name detection method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111616632.7A CN114266251A (en) 2021-12-27 2021-12-27 Malicious domain name detection method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114266251A true CN114266251A (en) 2022-04-01

Family

ID=80830626

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111616632.7A Pending CN114266251A (en) 2021-12-27 2021-12-27 Malicious domain name detection method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114266251A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928472A (en) * 2022-04-20 2022-08-19 哈尔滨工业大学(威海) Method for filtering bad site grey list based on full-volume circulation main domain name
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114928472A (en) * 2022-04-20 2022-08-19 哈尔滨工业大学(威海) Method for filtering bad site grey list based on full-volume circulation main domain name
CN114928472B (en) * 2022-04-20 2023-07-18 哈尔滨工业大学(威海) Bad site gray list filtering method based on full circulation main domain name
CN116186698A (en) * 2022-12-16 2023-05-30 广东技术师范大学 Machine learning-based secure data processing method, medium and equipment

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
CN107908963B (en) Method for automatically detecting core characteristics of malicious codes
CN107423444B (en) Hot word phrase extraction method and system
CN109960724B (en) Text summarization method based on TF-IDF
CN109547423B (en) WEB malicious request deep detection system and method based on machine learning
EP2506154B1 (en) Text, character encoding and language recognition
CN109582833B (en) Abnormal text detection method and device
WO2021139279A1 (en) Data processing method and apparatus based on classification model, and electronic device and medium
CN114266251A (en) Malicious domain name detection method and device, electronic equipment and storage medium
CN107862046A (en) A kind of tax commodity code sorting technique and system based on short text similarity
CN113139189A (en) Method, system and storage medium for identifying mining malicious software
CN115473726A (en) Method and device for identifying domain name
CN115953123A (en) Method, device and equipment for generating robot automation flow and storage medium
CN105468972B (en) A kind of mobile terminal document detection method
CN106815209B (en) Uygur agricultural technical term identification method
CN113282717A (en) Method and device for extracting entity relationship in text, electronic equipment and storage medium
CN113762294B (en) Feature vector dimension compression method, device, equipment and medium
CN111737694B (en) Malicious software homology analysis method based on behavior tree
CN112613176A (en) Slow SQL statement prediction method and system
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium
CN115834156A (en) Abnormal behavior detection method based on web access log
CN114707026A (en) Network model training method, character string detection method, device and electronic equipment
CN114528908A (en) Network request data classification model training method, classification method and storage medium
CN109858538B (en) Customs classification error detection method based on association rule

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination