CN111209497B - DGA domain name detection method based on GAN and Char-CNN - Google Patents

DGA domain name detection method based on GAN and Char-CNN Download PDF

Info

Publication number
CN111209497B
CN111209497B CN202010007697.0A CN202010007697A CN111209497B CN 111209497 B CN111209497 B CN 111209497B CN 202010007697 A CN202010007697 A CN 202010007697A CN 111209497 B CN111209497 B CN 111209497B
Authority
CN
China
Prior art keywords
domain name
layer
char
cnn
equal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010007697.0A
Other languages
Chinese (zh)
Other versions
CN111209497A (en
Inventor
杨超
杨延洲
苏锐丹
郑昱
尤伟
陈明哲
王潇皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010007697.0A priority Critical patent/CN111209497B/en
Publication of CN111209497A publication Critical patent/CN111209497A/en
Application granted granted Critical
Publication of CN111209497B publication Critical patent/CN111209497B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection

Abstract

The invention provides a DGA domain name detection method based on GAN and Char-CNN, which is used for solving the problem of low detection recall rate of a low-randomness DGA domain name in the prior art and comprises the following implementation steps: acquiring a training sample set and a verification sample set; constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN; generating an antagonistic network GAN and performing iterative training; acquiring an augmentation training set; performing iterative training on the character-level convolutional neural network Char-CNN; and detecting the domain name based on the trained character-level convolutional neural network Char-CNN'. According to the method, the antagonistic domain name is generated by using the GAN to augment the data set, the richness of the training sample set is improved, the error rate of the detection model is reduced by the residual block structure, the detection recall rate of the low-randomness DGA domain name is improved, meanwhile, the hyper-parameters needing to be calculated by the Char-CNN are few, and the training time of the detection model is shortened.

Description

DGA domain name detection method based on GAN and Char-CNN
Technical Field
The invention belongs to the technical field of network security, relates to a DGA domain name detection method, and particularly relates to a DGA domain name detection method based on GAN and Char-CNN, which can be used for positioning infected hosts, closing botnets and defending network attacks.
Background
The DGA domain name is a domain name periodically generated by using domain name Generation algorithm DGA (domain Generation algorithms) according to random seeds such as numbers, dates, Twitter hotspots, and the like. Network attackers register DGA domain names as the medium for bots to communicate with command and control servers, and these large number of potential DGA domain names make it difficult for law enforcement personnel to effectively shut down the botnet. The DGA domain name seriously threatens the safety of a network host, and particularly, the emerging low-randomness DGA domain name is strong in concealment and larger in threat, so that the DGA domain name is significant in effective detection. The DGA domain name detection task is to extract the characteristics of the domain name, calculate the extracted characteristics, output the prediction probability and further detect whether the domain name is the DGA domain name. Indexes for evaluating the detection effect of the DGA domain names are many, such as a working characteristic curve of a subject, an F1 value, a detection recall rate and the like, wherein the detection recall rate indicates a ratio of the detected DGA domain names to all DGA domain names, and thus is important for evaluating the detection recall rate indexes.
The DGA domain name detection method can be classified into a blacklist-based DGA domain name detection method, a machine learning-based DGA domain name detection method, and a deep learning-based DGA domain name detection method. The DGA domain name detection method based on the blacklist detects whether the domain name is the DGA domain name or not by judging whether the domain name is in a preset blacklist list or not, and the blacklist needs to be updated continuously, so that the method is poor in real-time performance. The DGA domain name detection method based on machine learning comprises the steps of manually extracting the characteristics of the length, the information entropy, the vowel character proportion, the number of repeated characters and the like of a domain name, detecting the DGA domain name by using machine learning algorithms such as a support vector machine and a random forest, and carrying out real-time detection. According to the DGA domain name detection method based on deep learning, potential features of a domain name are automatically extracted through a neural network model, prediction probability is output after neuron calculation, and therefore whether the domain name is the DGA domain name or not is detected.
In order to solve the problem, methods for extracting multidimensional characteristics of domain names through an integrated neural network and further detecting the DGA domain names are continuously provided in recent years. For example, an article, "integrated DGA domain name detection method based on deep learning" was published in 2018, volume 37, phase 10, "information technology and network security", by people such as ralla Yun, a middle electric great wall internet system application limited company, and an integrated DGA domain name detection method based on deep learning is proposed. The method integrates a Recurrent Neural Network (RNN) and a Convolutional Neural Network (CNN) in deep learning, and constructs an integrated detection model consisting of a character embedding layer, a feature extraction layer and a classification layer. The characteristic extraction layer adopts a CNN model and an RNN model to automatically extract the characteristics of the input characters from the dimensions of space and time respectively, and the detection recall rate of the DGA domain name is effectively improved. However, this method still has disadvantages: the low randomness DGA domain names contained in the training sample set are too small in number and low in richness, and meanwhile, the problem of gradient disappearance occurs when the network level is too deep, so that the error rate is increased, and the detection recall rate of the low randomness DGA domain names is low; the calculation of each time step in the recurrent neural network RNN depends on the calculation and the output of the previous time step, so that more hyper-parameters need to be calculated, and the training time of the detection model is increased.
Disclosure of Invention
The invention aims to provide a DGA domain name detection method based on GAN and Char-CNN aiming at the defects of the prior art, which is used for solving the problem of low detection recall rate of low-randomness DGA domain name in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) acquiring a training sample set and a verification sample set:
(1a) sequentially selecting the first L hot domain names from the hot domain name set Alexa to form a training sample set A, wherein L is more than or equal to 600000;
(1b) randomly selecting M benign domain names with the class of 0 from a benign domain name set TRANCO, labeling the class of each benign domain name, randomly selecting N DGA domain names with the class of 1 from a DGA domain name set DGArchive, labeling the class of each DGA domain name, then combining alpha, M benign domain names, alpha, N DGA domain names and labels corresponding to the domain names into a training sample set B, combining the rest M-alpha, M benign domain names, the rest N-alpha, N DGA domain names and labels corresponding to the domain names into a verification sample set, wherein M is more than or equal to 100000, N is more than or equal to 100000, and alpha is more than or equal to 0.6 and less than or equal to 0.8;
(2) constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN:
constructing a generation countermeasure network GAN comprising a generator network and a discriminator network, wherein the generator network comprises a full connection layer, a plurality of residual blocks, a one-dimensional convolution layer and an activation layer; the discriminator network comprises a one-dimensional convolution layer, a plurality of residual blocks and a full connection layer;
constructing a character-level convolutional neural network Char-CNN comprising an embedded layer, a plurality of one-dimensional convolutional layers, a plurality of active layers, a plurality of one-dimensional maximum pooling layers, a plurality of residual blocks, a Dropout layer and a plurality of fully-connected layers;
(3) generating an anti-network GAN for iterative training:
(3a) let the number of iterations be q1Maximum number of iterations is Q1,Q1Not less than 2000, and q is1=0;
(3b) Will random noise1Calculating as the input of a generator network to obtain m confrontation domain name vectors, and simultaneously coding m hot domain names randomly selected from a training sample set A to obtain m hot domain name vectors, wherein m is more than or equal to 64 and less than or equal to L;
(3c) predicting by taking m confrontation domain name vectors and m hot domain name vectors as the input of a discriminator network to obtain a probability set
Figure BDA0002355922620000031
Wherein the content of the first and second substances,
Figure BDA0002355922620000032
for the probability that the ith antagonistic domain name vector originates from the training sample set A, djThe probability that the jth hot domain name vector is derived from the training sample set A is represented by i being more than or equal to 1 and less than or equal to m, and j being more than or equal to 1 and less than or equal to m;
(3d) according to
Figure BDA0002355922620000033
Loss of compute generator networkgLoss of sum arbiter networkd
(3e) Using Adam's algorithm and passing through lossgAnd lossdTraining the generation antagonistic network GAN and judging q1=Q1If yes, obtaining a trained generation confrontation network GAN', otherwise, making q1=q1+1, and performing step (3 b);
(4) obtaining an augmentation training set:
(4a) will random noise2Calculating as a trained input for generating an antagonistic network GAN' to obtain P antagonistic domain name vectors, and decoding each antagonistic domain name vector to obtain P antagonistic domain names with the category of 1, wherein P is more than or equal to 20000 and less than or equal to L;
(4b) labeling the category of each confrontation domain name, and adding P confrontation domain names and the label of each confrontation domain name into a training sample set B to obtain an augmented training set;
(5) performing iterative training on a character-level convolutional neural network Char-CNN:
(5a) let the number of iterations be q2Maximum number of iterations is Q2,Q2Not less than 1000, and let q2=0;
(5b) Coding n domain names randomly selected from an augmented training set to obtain n domain name vectors, and predicting the n domain name vectors as the input of a character-level convolutional neural network Char-CNN to obtain a probability set { p }1,p2,...,pk,...,pnIn which p iskThe probability that the category of the kth domain name is 1 is more than or equal to 1 and less than or equal to N, and the probability that N is more than or equal to 32 and less than or equal to (alpha M + alpha N + P);
(5c) according to { p1,p2,...,pk,...,pnCalculating loss of the character-level convolutional neural network Char-CNN;
(5d) training a character-level convolutional neural network Char-CNN by adopting an RMSprop algorithm and through a value of lossObtaining the trained Char-CNN model Char-CNnq2
(5e) C verification domain names randomly selected from the verification sample set are coded to obtain c verification domain name vectors, and the c verification domain name vectors are used as Char-CNNq2Is predicted to obtain a probability set
Figure BDA0002355922620000041
Wherein the content of the first and second substances,
Figure BDA0002355922620000042
the probability that the category of the verification domain name is 1 is the v-th verification domain name, v is more than or equal to 1 and less than or equal to c, and c is more than or equal to 32 and less than or equal to (M-alpha M + N-alpha N);
(5f) according to
Figure BDA0002355922620000043
Calculating the detection Accuracy Accuracy of the c verification samples;
(5g) judging q2=Q2Whether the result is true or whether Accuracy is not increased any more is judged, if yes, a trained character-level convolutional neural network Char-CNN' is obtained, and otherwise, q is made2=q2+1, and performing step (5 b);
(6) detecting the domain name based on the trained character-level convolutional neural network Char-CNN':
(6a) setting the number of the domain names to be detected as t, and coding each domain name to be detected to obtain t domain name vectors to be detected, wherein t is more than or equal to 1;
(6b) predicting t domain name vectors to be detected as input of the trained character-level convolutional neural network Char-CNN' to obtain a probability set
Figure BDA0002355922620000044
And judge
Figure BDA0002355922620000045
If the result is true, the u-th domain name to be detected is the DGA domain name, otherwise, the u-th domain name to be detected is the non-DGA domain name,
Figure BDA0002355922620000046
the probability that the category of the u-th domain name to be detected is 1 is shown, and u is more than or equal to 1 and less than or equal to t.
Compared with the prior art, the invention has the following advantages:
firstly, the confrontation domain name is generated by generating the confrontation network GAN, and a generator network and a discriminator network in the generated confrontation network GAN are trained together to mutually game, so that the generated confrontation domain name can well simulate the hot domain name with low randomness; meanwhile, the residual block relieves the problem of gradient disappearance of a deep network through a target function of conversion learning, and reduces the error rate of a detection model, so that the detection recall rate of the low-randomness DGA domain name is further improved, and a simulation result shows that the detection recall rate is improved by 28.3 percent compared with the prior art.
Secondly, the DGA domain name is detected through the character-level convolutional neural network Char-CNN, the Char-CNN learns local features through convolutional calculation and then obtains overall features through aggregation, compared with the cyclic neural network RNN, the number of hyper-parameters needing to be calculated is less, meanwhile, the structure of a residual block in the Char-CNN is simple, the learning speed is high, and therefore the training time of a detection model is shortened.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a block diagram of the present invention for generating residual blocks in the countermeasure network GAN and the character level convolutional neural network Char-CNN;
Detailed Description
The invention is described in further detail below with reference to the figures and the specific embodiments.
Referring to fig. 1, the present invention includes the steps of:
(1) acquiring a training sample set and a verification sample set:
(1a) sequentially selecting the first L hot domain names from the hot domain name set Alexa to form a training sample set A, wherein L is more than or equal to 600000;
(1b) randomly selecting M benign domain names with the class of 0 from a benign domain name set TRANCO, labeling the class of each benign domain name, randomly selecting N DGA domain names with the class of 1 from a DGA domain name set DGArchive, labeling the class of each DGA domain name, then combining alpha, M benign domain names, alpha, N DGA domain names and labels corresponding to the domain names into a training sample set B, combining the rest M-alpha, M benign domain names, the rest N-alpha, N DGA domain names and labels corresponding to the domain names into a verification sample set, wherein M is more than or equal to 100000, N is more than or equal to 100000, and alpha is more than or equal to 0.6 and less than or equal to 0.8;
(2) constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN:
constructing a generation countermeasure network GAN comprising a generator network and a discriminator network, wherein the generator network comprises a full connection layer, a plurality of residual blocks, a one-dimensional convolution layer and an activation layer; the discriminator network comprises a one-dimensional convolution layer, a plurality of residual blocks and a full connection layer;
constructing a character-level convolutional neural network Char-CNN comprising an embedded layer, a plurality of one-dimensional convolutional layers, a plurality of active layers, a plurality of one-dimensional maximum pooling layers, a plurality of residual blocks, a Dropout layer and a plurality of fully-connected layers;
referring to fig. 2, a generator network, a discriminator network and a character-level convolutional neural network Char-CNN, wherein the residual block contained therein includes 2 active layers and 2 one-dimensional convolutional layers: the first active layer → the first one-dimensional convolution layer → the second active layer → the second one-dimensional convolution layer, wherein the activation function of the active layer is ReLU; the output space dimension of the one-dimensional convolution layer is 128, the size of the convolution kernel is 5, and the step length of the convolution kernel movement is 1 character; the input x of the first active layer and the output f (x) of the second one-dimensional convolution layer are added in a jump mode, the target function finally learned by the residual block is h (x), h (x) eta f (x) and x + x, wherein eta is a weight coefficient, and 0 is larger than or equal to eta and smaller than or equal to 1.
The target function of common deep network learning is f (x) ═ x, the derivative of the target function is constantly 1, the problem of gradient disappearance in the back propagation process can be caused, the problem of gradient disappearance of the deep network is relieved by the residual block through converting the learned target function, the error rate of the detection model is reduced, the detection recall rate of the low-randomness DGA domain name is improved, meanwhile, the residual block is simple in structure and high in learning speed, and the training time of the detection model is shortened.
The number of the residual blocks contained in the generator network and the arbiter network in the generation countermeasure network GAN is 5, where:
the specific structure of the generator network is as follows: fully-connected layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → one-dimensional convolution layer → active layer, wherein the fully-connected layer has an input spatial dimension of 128 and an output spatial dimension of 128 × 63; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the one-dimensional convolutional layer is 38, the size of the convolutional kernel is 1, and the step length of the convolutional kernel movement is 1 character; the activation function of the activation layer is Softmax;
the specific structure of the discriminator network is as follows: one-dimensional convolution layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → full-connected layer, wherein the input space dimension of the one-dimensional convolution layer is 38, the output space dimension is 128, the convolution kernel size is 1, and the step length of the convolution kernel movement is 1 character; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the fully connected layer is 1;
the number of one-dimensional convolutional layers contained in a character-level convolutional neural network Char-CNN is 2, the number of active layers is 4, the number of one-dimensional maximum pooling layers is 2, the number of residual blocks is 3, the number of full-link layers is 2, and the specific structure of the Char-CNN is as follows: the embedded layer → the first one-dimensional convolutional layer → the first active layer → the first one-dimensional maximum pooling layer → the second one-dimensional convolutional layer → the second active layer → the second one-dimensional maximum pooling layer → the first fully-connected layer → the first residual block → the second residual block → the third active layer → the Dropout layer → the second fully-connected layer → the fourth active layer, wherein the embedded layer has an input spatial dimension of 38, an output spatial dimension of 128, and a sequence length of 63; the output space dimensionality of all the one-dimensional convolutional layers is 128, the moving step length of the convolution kernel is 1 character, the convolution kernel size of the first one-dimensional convolutional layer is 3, and the convolution kernel size of the second one-dimensional convolutional layer is 2; the activation functions of the first, second and third activation layers are all ThresholdReLU, and the activation function of the fourth activation layer is Sigmoid; all the one-dimensional maximum pooling layers are filled in a same mode, and the size of a pooling window is 2; the weight coefficients of all the residual blocks are 0.3; the drop rate of the Dropout layer is 0.5; the output spatial dimension of the first fully-connected layer is 64 and the output spatial dimension of the second fully-connected layer is 1.
(3) Generating an anti-network GAN for iterative training:
(3a) let the number of iterations be q1Maximum number of iterations is Q1,Q1Not less than 2000, and q is1=0;
(3b) Random noise is generated by using random _ normal function contained in third-party library NumPy in Python language1To give noise1Calculating as the input of a generator network to obtain m confrontation domain name vectors, and simultaneously coding m hot domain names randomly selected from a training sample set A to obtain m hot domain name vectors, wherein m is more than or equal to 64 and less than or equal to L;
(3c) predicting by taking m confrontation domain name vectors and m hot domain name vectors as the input of a discriminator network to obtain a probability set
Figure BDA0002355922620000071
Wherein the content of the first and second substances,
Figure BDA0002355922620000072
for the probability that the ith antagonistic domain name vector originates from the training sample set A, djThe probability that the jth hot domain name vector is derived from the training sample set A is represented by i being more than or equal to 1 and less than or equal to m, and j being more than or equal to 1 and less than or equal to m;
(3d) according to
Figure BDA0002355922620000073
Loss of compute generator networkgLoss of sum arbiter networkdThe calculation formulas are respectively as follows:
Figure BDA0002355922620000074
Figure BDA0002355922620000075
(3e) using Adam's algorithm and passing through lossgAnd lossdTraining the generation antagonistic network GAN and judging q1=Q1If yes, obtaining a trained generation confrontation network GAN', otherwise, making q1=q1+1, and performing step (3 b);
(4) obtaining an augmentation training set:
(4a) random noise is generated by using random _ normal function contained in third-party library NumPy in Python language2To give noise2Calculating as a trained input for generating an antagonistic network GAN' to obtain P antagonistic domain name vectors, and decoding each antagonistic domain name vector to obtain P antagonistic domain names with the category of 1, wherein P is more than or equal to 20000 and less than or equal to L;
(4b) labeling the category of each confrontation domain name, and adding P confrontation domain names and the label of each confrontation domain name into a training sample set B to obtain an augmented training set;
the confrontation domain names generated by mutual game of the generator network and the discriminator network in the GAN can well simulate the hot domain names with low randomness, are generated by an algorithm and have low randomness, can be regarded as DGA domain names with low randomness, and can be added into the training sample set to improve the richness of the training sample set and effectively improve the detection recall rate of the DGA domain names with low randomness.
(5) Performing iterative training on a character-level convolutional neural network Char-CNN:
(5a) let the number of iterations be q2Maximum number of iterations is Q2,Q2Not less than 1000, and let q2=0;
(5b) Coding n domain names randomly selected from the augmented training set to obtain n domain name vectors, andpredicting n domain name vectors as the input of a character-level convolutional neural network Char-CNN to obtain a probability set { p }1,p2,...,pk,...,pnIn which p iskThe probability that the category of the kth domain name is 1 is more than or equal to 1 and less than or equal to N, and the probability that N is more than or equal to 32 and less than or equal to (alpha M + alpha N + P);
(5c) according to { p1,p2,...,pk,...,pnAnd calculating loss of the character-level convolutional neural network Char-CNN, wherein the calculation formula is as follows:
Figure BDA0002355922620000081
wherein, ykTrue category for the kth domain name;
(5d) training a character-level convolutional neural network Char-CNN by adopting an RMSprop algorithm and a loss value to obtain a trained Char-CNN model Char-CNNq2
(5e) C verification domain names randomly selected from the verification sample set are coded to obtain c verification domain name vectors, and the c verification domain name vectors are used as Char-CNNq2Is predicted to obtain a probability set
Figure BDA0002355922620000091
Wherein the content of the first and second substances,
Figure BDA0002355922620000092
the probability that the category of the verification domain name is 1 is the v-th verification domain name, v is more than or equal to 1 and less than or equal to c, and c is more than or equal to 32 and less than or equal to (M-alpha M + N-alpha N);
(5f) according to
Figure BDA0002355922620000093
Calculating the detection Accuracy of the c verification samples, wherein the calculation formula is as follows:
Figure BDA0002355922620000094
wherein tp is the number of samples of which the real category is 1 and the probability of predicting the category to be 1 is greater than 0.5 in the c verification samples; tn is the number of samples with the true category of 0 in the verification samples and the probability of predicting the category of 1 not more than 0.5;
(5g) judging q2=Q2Whether the result is true or whether Accuracy is not increased any more is judged, if yes, a trained character-level convolutional neural network Char-CNN' is obtained, and otherwise, q is made2=q2+1, and performing step (5 b);
the character-level convolutional neural network Char-CNN is a feedforward neural network which comprises convolutional calculation and has a deep structure, local learning features are reunited to obtain overall features, potential features can be fully extracted, compared with a Recurrent Neural Network (RNN), the number of hyper-parameters needing calculation is less, meanwhile, a residual block in the convolutional neural network has a simple structure and high learning speed, and therefore training time of a detection model is shortened.
(6) Detecting the domain name based on the trained character-level convolutional neural network Char-CNN':
(6a) setting the number of the domain names to be detected as t, and coding each domain name to be detected to obtain t domain name vectors to be detected, wherein t is more than or equal to 1;
(6b) predicting t domain name vectors to be detected as input of the trained character-level convolutional neural network Char-CNN' to obtain a probability set
Figure BDA0002355922620000095
And judge
Figure BDA0002355922620000096
If the result is true, the u-th domain name to be detected is the DGA domain name, otherwise, the u-th domain name to be detected is the non-DGA domain name,
Figure BDA0002355922620000097
the probability that the category of the u-th domain name to be detected is 1 is shown, and u is more than or equal to 1 and less than or equal to t.
The process of domain name coding involved in the above steps is: firstly, establishing mapping from characters to numbers according to an effective character set in a domain name, then traversing the characters in the domain name in sequence, converting the characters into corresponding numbers one by one, and finally filling 0 to obtain domain name vectors with the same length; the process of domain name decoding is as follows: firstly, mapping from numbers to characters is established according to an effective character set in a domain name, then, numbers in a vector are traversed in sequence, non-0 numbers are converted into corresponding characters one by one, and finally, the domain name is obtained.
The technical effects of the present invention will be further described with reference to simulation experiments.
1. Simulation conditions and contents:
during simulation experiments, a training sample set A consists of the first 600000 popular domain names sequentially selected from a popular domain name set Alexa; the training sample set B consists of 80000 benign domain names randomly selected from a benign domain name set TRANCO, 80000 DGA domain names randomly selected from a DGA domain name set DGArchive and labels corresponding to the domain names; the verification sample set consists of 20000 benign domain names randomly selected from a benign domain name set TRANCO, 20000 DGA domain names randomly selected from a DGA domain name set DGArchive and labels corresponding to the domain names; the number of training iterations is 2000; the domain names to be detected comprise 1000 low-randomness DGA domain names and 1000 high-randomness DGA domain names. The hardware platform is an Intel Core i7-7700K @4.50GHz CPU, an 8GB RAM and an NVIDIA Geforce GTX2080 GPU, and the operating system is Ubuntu 16.04 LTS; the simulation experiment software platforms are Python 3.6.5, Tensorflow 1.3 and Keras 2.2.1.
Simulation I, comparing and simulating the detection recall rate of the low-randomness DGA domain name of the integrated DGA domain name detection method based on deep learning, wherein the result is shown in table 1;
secondly, comparing and simulating the training time of the detection model of the integrated DGA domain name detection method based on deep learning, wherein the result is shown in Table 2;
2. and (3) simulation result analysis:
TABLE 1
Figure BDA0002355922620000101
TABLE 2
Training time for prior art detection models Training time of detection model of the invention
724min 482min
As can be seen from Table 1, compared with the existing integrated DGA domain name detection method based on deep learning, the DGA domain name detection method based on GAN and Char-CNN provided by the invention has the advantages that the detection recall rate of the low-randomness DGA domain name is improved by 28.3% on the premise of keeping the detection recall rate of the traditional high-randomness DGA domain name, which shows that the DGA domain name detection method based on GAN and Char-CNN provided by the invention can well extract features, improve the richness of a training sample set, reduce the error rate of a detection model, and further improve the detection recall rate of the low-randomness DGA domain name, thereby having important practical significance.
As can be seen from Table 2, compared with the existing integrated DGA domain name detection method based on deep learning, the DGA domain name detection method based on GAN and Char-CNN provided by the invention shortens the training time of the detection model by 242 minutes, which shows that the DGA domain name detection method based on GAN and Char-CNN provided by the invention has fewer hyper-parameters to be calculated, the structure of the residual block in Char-CNN is simple, the learning speed is high, and further the training time of the detection model is shortened.
The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.

Claims (7)

1. A DGA domain name detection method based on GAN and Char-CNN is characterized by comprising the following steps:
(1) acquiring a training sample set and a verification sample set:
(1a) sequentially selecting the first L hot domain names from the hot domain name set Alexa to form a training sample set A, wherein L is more than or equal to 600000;
(1b) randomly selecting M benign domain names with the class of 0 from a benign domain name set TRANCO, labeling the class of each benign domain name, randomly selecting N DGA domain names with the class of 1 from a DGA domain name set DGArchive, labeling the class of each DGA domain name, then combining alpha, M benign domain names, alpha, N DGA domain names and labels corresponding to the domain names into a training sample set B, combining the rest M-alpha, M benign domain names, the rest N-alpha, N DGA domain names and labels corresponding to the domain names into a verification sample set, wherein M is more than or equal to 100000, N is more than or equal to 100000, and alpha is more than or equal to 0.6 and less than or equal to 0.8;
(2) constructing and generating a countermeasure network GAN and a character-level convolutional neural network Char-CNN:
constructing a generation countermeasure network GAN comprising a generator network and a discriminator network, wherein the generator network comprises a full connection layer, a plurality of residual blocks, a one-dimensional convolution layer and an activation layer; the discriminator network comprises a one-dimensional convolution layer, a plurality of residual blocks and a full connection layer;
constructing a character-level convolutional neural network Char-CNN comprising an embedded layer, a plurality of one-dimensional convolutional layers, a plurality of active layers, a plurality of one-dimensional maximum pooling layers, a plurality of residual blocks, a Dropout layer and a plurality of fully-connected layers;
(3) generating an anti-network GAN for iterative training:
(3a) let the number of iterations be q1Maximum number of iterations is Q1,Q1Not less than 2000, and q is1=0;
(3b) Will random noise1As the input of the generator network, calculating to obtain m confrontation domain name vectors, and simultaneously carrying out hot-gating on m randomly selected from the training sample set ACoding the domain name to obtain m hot domain name vectors, wherein m is more than or equal to 64 and less than or equal to L;
(3c) predicting by taking m confrontation domain name vectors and m hot domain name vectors as the input of a discriminator network to obtain a probability set
Figure FDA0002355922610000011
Wherein the content of the first and second substances,
Figure FDA0002355922610000012
for the probability that the ith antagonistic domain name vector originates from the training sample set A, djThe probability that the jth hot domain name vector is derived from the training sample set A is represented by i being more than or equal to 1 and less than or equal to m, and j being more than or equal to 1 and less than or equal to m;
(3d) according to
Figure FDA0002355922610000021
Loss of compute generator networkgLoss of sum arbiter networkd
(3e) Using Adam's algorithm and passing through lossgAnd lossdTraining the generation antagonistic network GAN and judging q1=Q1If yes, obtaining a trained generation confrontation network GAN', otherwise, making q1=q1+1, and performing step (3 b);
(4) obtaining an augmentation training set:
(4a) will random noise2Calculating as a trained input for generating an antagonistic network GAN' to obtain P antagonistic domain name vectors, and decoding each antagonistic domain name vector to obtain P antagonistic domain names with the category of 1, wherein P is more than or equal to 20000 and less than or equal to L;
(4b) labeling the category of each confrontation domain name, and adding P confrontation domain names and the label of each confrontation domain name into a training sample set B to obtain an augmented training set;
(5) performing iterative training on a character-level convolutional neural network Char-CNN:
(5a) let the number of iterations be q2Maximum number of iterations is Q2,Q2Not less than 1000, and let q2=0;
(5b) Coding n domain names randomly selected from an augmented training set to obtain n domain name vectors, and predicting the n domain name vectors as the input of a character-level convolutional neural network Char-CNN to obtain a probability set { p }1,p2,...,pk,...,pnIn which p iskThe probability that the category of the kth domain name is 1 is more than or equal to 1 and less than or equal to N, and the probability that N is more than or equal to 32 and less than or equal to (alpha M + alpha N + P);
(5c) according to { p1,p2,...,pk,...,pnCalculating loss of the character-level convolutional neural network Char-CNN;
(5d) training a character-level convolutional neural network Char-CNN by adopting an RMSprop algorithm and a loss value to obtain a trained Char-CNN model Char-CNNq2
(5e) C verification domain names randomly selected from the verification sample set are coded to obtain c verification domain name vectors, and the c verification domain name vectors are used as Char-CNNq2Is predicted to obtain a probability set
Figure FDA0002355922610000022
Wherein the content of the first and second substances,
Figure FDA0002355922610000023
the probability that the category of the verification domain name is 1 is the v-th verification domain name, v is more than or equal to 1 and less than or equal to c, and c is more than or equal to 32 and less than or equal to (M-alpha M + N-alpha N);
(5f) according to
Figure FDA0002355922610000031
Calculating the detection Accuracy Accuracy of the c verification samples;
(5g) judging q2=Q2Whether the result is true or whether Accuracy is not increased any more is judged, if yes, a trained character-level convolutional neural network Char-CNN' is obtained, and otherwise, q is made2=q2+1, and performing step (5 b);
(6) detecting the domain name based on the trained character-level convolutional neural network Char-CNN':
(6a) setting the number of the domain names to be detected as t, and coding each domain name to be detected to obtain t domain name vectors to be detected, wherein t is more than or equal to 1;
(6b) predicting t domain name vectors to be detected as input of the trained character-level convolutional neural network Char-CNN' to obtain a probability set
Figure FDA0002355922610000032
And judge
Figure FDA0002355922610000033
If the result is true, the u-th domain name to be detected is the DGA domain name, otherwise, the u-th domain name to be detected is the non-DGA domain name,
Figure FDA0002355922610000034
the probability that the category of the u-th domain name to be detected is 1 is shown, and u is more than or equal to 1 and less than or equal to t.
2. The GAN and Char-CNN based DGA domain name detection method of claim 1, wherein the generator network, the discriminator network and the character level convolutional neural network Char-CNN in step (2) comprise a residual block comprising 2 active layers and 2 one-dimensional convolutional layers: the first active layer → the first one-dimensional convolution layer → the second active layer → the second one-dimensional convolution layer, wherein the activation function of the active layer is ReLU; the output space dimension of the one-dimensional convolution layer is 128, the size of the convolution kernel is 5, and the step length of the convolution kernel movement is 1 character; the input x of the first active layer and the output f (x) of the second one-dimensional convolution layer are added in a jump mode, the target function finally learned by the residual block is h (x), h (x) eta f (x) and x + x, wherein eta is a weight coefficient, and 0 is larger than or equal to eta and smaller than or equal to 1.
3. The DGA domain name detection method based on GAN and Char-CNN as claimed in claim 1, wherein the generation of the antagonistic network GAN and the character level convolutional neural network Char-CNN in step (2) has the following specific structures:
the generation countermeasure network GAN, in which the generator network and the discriminator network each include 5 residual blocks, where:
the specific structure of the generator network is as follows: fully-connected layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → one-dimensional convolution layer → active layer, wherein the fully-connected layer has an input spatial dimension of 128 and an output spatial dimension of 128 × 63; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the one-dimensional convolutional layer is 38, the size of the convolutional kernel is 1, and the step length of the convolutional kernel movement is 1 character; the activation function of the activation layer is Softmax;
the specific structure of the discriminator network is as follows: one-dimensional convolution layer → first residual block → second residual block → third residual block → fourth residual block → fifth residual block → full-connected layer, wherein the input space dimension of the one-dimensional convolution layer is 38, the output space dimension is 128, the convolution kernel size is 1, and the step length of the convolution kernel movement is 1 character; the weight coefficients of all the residual blocks are 0.3; the output space dimension of the fully connected layer is 1;
the number of the one-dimensional convolutional layers contained in the character-level convolutional neural network Char-CNN is 2, the number of the active layers is 4, the number of the one-dimensional maximum pooling layers is 2, the number of the residual blocks is 3, the number of the full-connection layers is 2, and the specific structure of the Char-CNN is as follows: the embedded layer → the first one-dimensional convolutional layer → the first active layer → the first one-dimensional maximum pooling layer → the second one-dimensional convolutional layer → the second active layer → the second one-dimensional maximum pooling layer → the first fully-connected layer → the first residual block → the second residual block → the third active layer → the Dropout layer → the second fully-connected layer → the fourth active layer, wherein the embedded layer has an input spatial dimension of 38, an output spatial dimension of 128, and a sequence length of 63; the output space dimensionality of all the one-dimensional convolutional layers is 128, the moving step length of the convolution kernel is 1 character, the convolution kernel size of the first one-dimensional convolutional layer is 3, and the convolution kernel size of the second one-dimensional convolutional layer is 2; the activation functions of the first, second and third activation layers are all ThresholdReLU, and the activation function of the fourth activation layer is Sigmoid; all the one-dimensional maximum pooling layers are filled in a same mode, and the size of a pooling window is 2; the weight coefficients of all the residual blocks are 0.3; the drop rate of the Dropout layer is 0.5; the output spatial dimension of the first fully-connected layer is 64 and the output spatial dimension of the second fully-connected layer is 1.
4. The GAN and Char-CNN based DGA domain name detection method of claim 1 wherein the loss of generator network in step (3d)gLoss of sum arbiter networkdThe calculation formulas are respectively as follows:
Figure FDA0002355922610000041
Figure FDA0002355922610000051
5. the method for detecting DGA domain name based on GAN and Char-CNN as claimed in claim 1, wherein the loss of the character level convolutional neural network Char-CNN in step (5c) is calculated as:
Figure FDA0002355922610000052
wherein, ykThe true category of the kth domain name.
6. The GAN and Char-CNN based DGA domain name detection method of claim 1, wherein the detection Accuracy of the c verification samples in step (5f) is calculated by the following formula:
Figure FDA0002355922610000053
wherein tp is the number of samples of which the real category is 1 and the probability of predicting the category to be 1 is greater than 0.5 in the c verification samples; tn is the number of samples in which the true class is 0 and the probability of predicting class to be 1 is not more than 0.5 in the verification samples.
7. The GAN and Char-CNN based DGA domain name detection method of claim 1 wherein the random noise in step (3b)1And the random noise described in the step (4a)2All generated by using random _ normal function contained in the third-party library NumPy in Python language.
CN202010007697.0A 2020-01-05 2020-01-05 DGA domain name detection method based on GAN and Char-CNN Active CN111209497B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010007697.0A CN111209497B (en) 2020-01-05 2020-01-05 DGA domain name detection method based on GAN and Char-CNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010007697.0A CN111209497B (en) 2020-01-05 2020-01-05 DGA domain name detection method based on GAN and Char-CNN

Publications (2)

Publication Number Publication Date
CN111209497A CN111209497A (en) 2020-05-29
CN111209497B true CN111209497B (en) 2022-03-04

Family

ID=70788417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010007697.0A Active CN111209497B (en) 2020-01-05 2020-01-05 DGA domain name detection method based on GAN and Char-CNN

Country Status (1)

Country Link
CN (1) CN111209497B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116601B (en) * 2020-08-18 2023-04-28 河南大学 Compressed sensing sampling reconstruction method and system based on generation of countermeasure residual error network
CN112019651B (en) * 2020-08-26 2021-11-23 重庆理工大学 DGA domain name detection method using depth residual error network and character-level sliding window
CN112101464B (en) * 2020-09-17 2024-03-15 西安锐思数智科技股份有限公司 Deep learning-based image sample data acquisition method and device
CN112104674B (en) * 2020-11-17 2021-05-11 鹏城实验室 Attack detection recall rate automatic test method, device and storage medium
CN112527547B (en) * 2020-12-17 2022-05-17 中国地质大学(武汉) Mechanical intelligent fault prediction method based on automatic convolution neural network
CN112765319B (en) * 2021-01-20 2021-09-03 中国电子信息产业集团有限公司第六研究所 Text processing method and device, electronic equipment and storage medium
CN112953914A (en) * 2021-01-29 2021-06-11 浙江大学 DGA domain name detection and classification method and device
CN113673680B (en) * 2021-08-20 2023-09-15 上海大学 Model verification method and system for automatically generating verification properties through an antagonism network
CN113709152B (en) * 2021-08-26 2022-11-25 东南大学 Antagonistic domain name generation model with high-resistance detection capability
CN114006752A (en) * 2021-10-29 2022-02-01 中电福富信息科技有限公司 DGA domain name threat detection system based on GAN compression algorithm and training method thereof
CN114021698A (en) * 2021-10-30 2022-02-08 河南省鼎信信息安全等级测评有限公司 Malicious domain name training sample expansion method and device based on capsule generation countermeasure network
CN113806338B (en) * 2021-11-18 2022-02-18 深圳索信达数据技术有限公司 Data discrimination method and system based on data sample imaging
CN114782961B (en) * 2022-03-23 2023-04-18 华南理工大学 Character image augmentation method based on shape transformation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109391602A (en) * 2017-08-11 2019-02-26 北京金睛云华科技有限公司 A kind of zombie host detection method
CN110113327A (en) * 2019-04-26 2019-08-09 北京奇安信科技有限公司 A kind of method and device detecting DGA domain name

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1471583B1 (en) * 2002-01-28 2009-10-07 Nichia Corporation Nitride semiconductor device having support substrate and its manufacturing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109391602A (en) * 2017-08-11 2019-02-26 北京金睛云华科技有限公司 A kind of zombie host detection method
CN110113327A (en) * 2019-04-26 2019-08-09 北京奇安信科技有限公司 A kind of method and device detecting DGA domain name

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MaskDGA:a black-box evasion technique against DGA classifiers and adversarial defenses;Lior Sidi et al.;《arXiv preprint arXiv》;20191231;全文 *
基于生成对抗网络的恶意域名训练数据生成;袁辰 等;《计算机应用研究》;20191231;全文 *

Also Published As

Publication number Publication date
CN111209497A (en) 2020-05-29

Similar Documents

Publication Publication Date Title
CN111209497B (en) DGA domain name detection method based on GAN and Char-CNN
CN109101552B (en) Phishing website URL detection method based on deep learning
CN109597997B (en) Comment entity and aspect-level emotion classification method and device and model training thereof
CN113408743B (en) Method and device for generating federal model, electronic equipment and storage medium
CN110048827B (en) Class template attack method based on deep learning convolutional neural network
CN109413028A (en) SQL injection detection method based on convolutional neural networks algorithm
CN109670303B (en) Password attack evaluation method based on conditional variation self-coding
CN106897254B (en) Network representation learning method
Yun et al. Khaos: An adversarial neural network DGA with high anti-detection ability
CN110225030A (en) Malice domain name detection method and system based on RCNN-SPP network
CN112487807A (en) Text relation extraction method based on expansion gate convolution neural network
CN112217787B (en) Method and system for generating mock domain name training data based on ED-GAN
CN112073551B (en) DGA domain name detection system based on character-level sliding window and depth residual error network
CN113691542B (en) Web attack detection method and related equipment based on HTTP request text
CN111651762A (en) Convolutional neural network-based PE (provider edge) malicious software detection method
CN112232087A (en) Transformer-based specific aspect emotion analysis method of multi-granularity attention model
CN113269228B (en) Method, device and system for training graph network classification model and electronic equipment
CN112215292A (en) Image countermeasure sample generation device and method based on mobility
CN114417427A (en) Deep learning-oriented data sensitivity attribute desensitization system and method
Feng et al. A phishing webpage detection method based on stacked autoencoder and correlation coefficients
CN114297079A (en) XSS fuzzy test case generation method based on time convolution network
TWI684889B (en) Method for evaluating domain name and server using the same
Rando et al. PassGPT: Password Modeling and (Guided) Generation with Large Language Models
CN116306780B (en) Dynamic graph link generation method
CN115834251B (en) Hypergraph-transform-based threat hunting model building method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant