CN102411687B

CN102411687B - Deep learning detection method of unknown malicious codes

Info

Publication number: CN102411687B
Application number: CN201110373558.0A
Authority: CN
Inventors: 李元诚; 樊庆君
Original assignee: North China Electric Power University
Current assignee: North China Electric Power University
Priority date: 2011-11-22
Filing date: 2011-11-22
Publication date: 2014-04-23
Anticipated expiration: 2031-11-22
Also published as: CN102411687A

Abstract

The invention discloses a deep learning detection method of unknown malicious codes, belonging to the technical field of information security. The deep learning detection method of unknown malicious codes comprises the following steps of: firstly, extracting characteristic vectors of documents in a training set by using byte level n-gram; secondly, constituting an HTM (Hypertext Markup Language) network structure and determining the input data length of each node at the bottom layer of the HTM structure; thirdly, carrying out sequence pattern learning practice and classification derivation with an HTM algorithm by using the characteristic vector as input; fourthly, extracting characteristic vectors of documents in a testing set by using byte level n-gram; fifthly, inputting the characteristic vectors into an HTM network with finished practice for sequence identification, so as to determine whether the test centralized documents contain malicious codes or not. The invention has the beneficial effects of relatively high noise resistance and fault-tolerant ability, and strong adaptability. Simultaneously, the deep learning detection method disclosed by the invention has the advantages of improving the identification ability and identification rate of malicious code detection and realizing accurate detection of new targets of malicious codes.

Description

The degree of deep learning detection method of unknown malicious code

Technical field

The invention belongs to field of information security technology, relate in particular to the degree of deep learning detection method of unknown malicious code.

Background technology

Development along with computer technology and network technology, computing machine has become instrument indispensable in people's daily life, in order to obtain economy, political interest or to carry out individual's revenge, a large amount of tissues or individual use various malicious codes to carry out unlawful activities, the thing followed is that all kinds of malicious codes emerge in an endless stream, the technology that malicious code adopts is also more and more advanced, and it is propagated, harm, the ability such as hide constantly strengthen.Although the detection technique of various malicious codes is also in continuous development, but the detection technique of malicious code and the development that detectability still lags behind malicious code, particularly proposed huge challenge to the detectability of unknown malicious code to malicious code detection technique at present.

At present computer malevolence code detection technique mainly contains two kinds, and a kind of is mode-matching technique based on condition code, and another kind is the detection technique based on malicious code rule of conduct.

Mode-matching technique based on condition code is that the feature code of detected file is mated with the malicious code feature string in property data base, in the detected file of the match is successful interval scale, containing malicious code, otherwise think that detected file is containing malicious code.This Technology Need technician finds and obtains malicious code sample the very first time, and can extract the unique identification condition code of malicious code.Need in addition in time signature update in malicious code condition code storehouse, to detect before this malicious code wide-scale distribution and outburst.This detection technique is not suitable for introducing the malicious code detection of polymorphic and deformation technology, and propagates detection rapid, malicious code that destructive power is strong, with strong points.Detection technique based on malicious code rule of conduct is to carry out detection of malicious code according to the common rule of conduct of the predefined malicious code of expert.This technology cardinal principle is that the operation action of malicious code is often followed the behaviors such as user right change, Registry Modifications, open, the abnormal network service of the network port, or certain particular system sequence of operation.There is serious hysteresis quality defect in this technology, particularly, along with the significantly lifting of computer run speed, while malicious code behavior by the time being detected, often to system, has brought irreparable damage.Above-mentioned two kinds of detection techniques are all a kind of detection techniques afterwards, known malicious code can only be detected, or just can be detected after malicious code is performed, yet malicious code has caused destruction during this period.

Summary of the invention

The present invention is directed to the degree of deep learning detection method that above-mentioned defect discloses unknown malicious code, it comprises the following steps:

1) utilize byte level n-gram to extract the proper vector of training centralized documentation;

2) build the HTM network model of a multilayer, and by intercepting document characteristic vector method, determine the input data length of each node of bottom in HTM structure;

3) using the document characteristic vector that intercepts as the input of HTM network, after all nodes study of bottom, through the successively output in derivation stage, connect, complete the learning training of all node layers of HTM network;

4) utilize byte level n-gram to extract the proper vector of the file in test set;

5) the HTM network that proper vector has been input to training carries out recognition sequence, to determine whether the file in test set contains malicious code.

Described step 2) specifically comprise the following steps:

21) select the HTM network model of a F layer, outer each node of definite division bottom has M child node;

22) utilize formula l=L/M (F-1) intercepting document characteristic vector, and using the document characteristic vector of intercepting successively as the input sample of each node of HTM network bottom layer, wherein F is the number of plies of HTM network structure, M is the quantity of the child node except bottom exterior node, L is the document characteristic vector length of utilizing byte level n-gram method to extract, and l is the input sample length of each node of HTM network bottom layer.

Described step 3) specifically comprises the following steps:

31) using the document characteristic vector that intercepts as the input of HTM network, bottom layer node enters learning phase, until the pond, space of all nodes of bottom all completes study, all study of deadline grouping of transient state pond of spatial model;

32) bottom layer node is through step 31) after the learning stage finishes, bottom layer node enters the derivation stage, new sample input is after bottom layer node is derived, export to the father node that it is positioned at one deck under HTM network, there is the output of lower level node of identical father node after connecting, become the input of next node layer learning phase, next node layer enters learning phase repeating step 31) in the learning process of node;

Step 33) repeating step 32) process, until the node of all layer of HTM network has all completed the learning training of sequence pattern.

Described step 31) specifically comprise the following steps:

311) binary sequence that is input to node is input to pond, space, and the cluster of these sequences is learnt in pond, space with ultimate range parameter D; Pond, space is used the method for ultimate range D to store the subset of input pattern, is called cluster centre; Along with the increase of time, the quantity of the new sequence pattern that pond, space produces within the unit interval can reduce, when the quantity of the new cluster centre of each time cycle lower than set threshold value time, cluster process will stop;

312) transient state pond is exported to the sequence pattern of having learnt in pond, space, divides into groups to sequence pattern according to the time adjacency of sequence pattern in transient state pond, until after all sequence patterns are all grouped, grouping is calculated and finished.

Described step 32) specifically comprise the following steps:

321) utilize formula

calculate list entries e ^-spatial model c based on node space pond _iprobability distribution, after Regularization as the output in pond, space, wherein represent the spatial model of non-zero, M is the quantity of the child node of this node, e ^-for the list entries to be identified from bottom;

322) the output y based on pond, space, utilizes formula

calculate the output in transient state pond, wherein, N _cfor vector length and the space pool space number of modes of y, λ length is N _g.

Beneficial effect of the present invention is: introduce HTM algorithm, imitate the structure of mankind's neopallium and the novel artificial of principle of work intelligence degree of deep learning algorithm, adopt hierarchical tree network structure and the information between node of applying in Bayesian network continues to share principle and degree of belief transfer principle, challenge is converted into pattern match and prediction.And, input data are not needed to carry out complicated pre-service, there is stronger anti-noise, fault-tolerant ability, strong adaptability., in the process that old model is derived, can learn new input pattern meanwhile, improve recognition capability and discrimination that malicious code detects, realize the target of the emerging malicious code of accurate detection.

Accompanying drawing explanation

Fig. 1 is the process schematic diagram of the detection method of unknown malicious code;

Fig. 2 a is HTM level construction tree model schematic diagram;

Fig. 2 b is pond, space and the transient state pond schematic diagram of node K;

Fig. 3 is the learning process schematic diagram of a node of HTM Algorithm for Training process;

Fig. 4 is that the degree of belief of node k in HTM algorithmic derivation process is transmitted computational details schematic diagram.

Embodiment

Below in conjunction with accompanying drawing, preferred embodiment is elaborated.Should be emphasized that, following explanation is only exemplary, rather than in order to limit the scope of the invention and to apply.

The thinking that the present invention deals with problems is: the file set that contains malicious code of take is training sample, adopt byte level n-gram to carry out feature selecting to training set file, thereby the corresponding proper vector of each file, proper vector is trained HTM network as the input of HTM algorithm.Whether finally unknown file is carried out to feature selecting and produce characteristic of correspondence vector, as the input that completes the HTM network of training, it is carried out to pattern-recognition, be the file that comprises malicious code thereby tell it.

As shown in Figure 1, the intelligent detecting method of unknown malicious code comprises the steps:

1) utilize byte level n-gram to extract the proper vector of training centralized documentation.

Can download to the standard data set that is used for specially carrying out malicious code detection on the net, to concentrate select File to construct training set according to ad hoc rules from normal data, such as constructing training set according to malicious code kind select File.

Byte level n-gram is to adopt the moving window of a n byte-sized to get word to binary word throttling or text, and each word is n byte-sized.Such as the content of a text is " abcdef ", its 2-grams sequence is so: ab, bc, cd, de, ef, its 3-grams sequence is: abc, bcd, cde, def.

The content of a file of take is that " abcd " is example, this document is extracted to 2-grams sequence be: ab, bc, cd, so just say that this file has three attributes, the vector that can utilize these three attributes to form represents this file, vector is: { ab, bc, cd}.

Each attribute is quantized, can obtain the proper vector of this document.With above-mentioned vector, { cd} is example for ab, bc, and it is 2 that a is set to 1, b at alphabet meta, c is that 3, d is 4, with position and rule quantize, so, the quantized result of ab is 3, the quantized result 5 of bc, the quantized result of cd is 7, and vector { 3,5,7} is the proper vector of this document.

2) build the HTM network model of a multilayer, and by intercepting document characteristic vector method, determine the input data length of each node of bottom in HTM structure.Be 3 layers of tree structure model as shown in Figure 2 a, except bottom layer node, each node has two child nodes.Be the cut-away view of individual node k in HTM structure as shown in Figure 2 b, its have living space pond and transient state pond forms.Step 2) specifically comprise the following steps:

21) select the HTM network structure of a F layer, outer each node of definite division bottom has M child node.

As shown in Figure 2 a, select F=3, M=2, this HTM network L3 layer, L2 layer, L1 layer have respectively 1,2,4 node.

Suppose document characteristic vector for the length L of 1,2,3,4,5,6,7,8} is 8, l=2, the input sample of each node of HTM network bottom layer is respectively { 1,2}, { 3,4}, { 5,6}, { 7,8}.

3) using the document characteristic vector that intercepts as the input of HTM network, after all nodes study of bottom, through the successively output in derivation stage, connect, complete the learning training of all node layers of HTM network.

The learning training process in this stage is for what successively complete, and after the study of bottom completes, when new input arrives, the node of bottom enters the derivation stage, and the Output rusults of derivation is as the input of next node layer learning phase; For individual node, be also after node space pond sequence pattern has been trained, transient state pond just starts to carry out time grouping.

Step 3) specifically comprises the following steps:

31) as shown in Figure 3, using the document characteristic vector that intercepts as the input of HTM network, bottom layer node enters learning phase, until the pond, space of all nodes of bottom all completes study, all study of deadline grouping of transient state pond of spatial model.

Specifically, step 31) specifically comprise the following steps: again

311) the binary sequence pattern of the document characteristic vector of intercepting is input in the pond, space of bottom node, the cluster of these sequences is learnt in pond, space with ultimate range D.Pond, space is used the method for ultimate range D to store the subset of the binary sequence pattern of input, is called cluster centre.Along with the increase of time, the quantity of the new sequence pattern that pond, space produces within the unit interval can reduce, when the quantity of the new cluster centre of a unit interval cycle T lower than set threshold value time, cluster process will stop.Period of time T is optional is not 0 arbitrary value, and threshold value is non-zero integer.In order to improve learning efficiency, period of time T and threshold value are generally got a less value (such as period of time T is got 5s, threshold value gets 1).

The implication of D is to assert that a binary sequence pattern is different from the minimum euclidean distance of already present cluster centre.For each input binary sequence pattern, all to check that the cluster centre whether existing within Euclidean distance D (is divided into two kinds of situations: if existed, maintain the statusquo; If there is no, this new binary sequence pattern is added in cluster centre list).

Euclidean distance algorithm is as follows: establish x, y ∈ R ⁿ, x, the Euclidean distance of y is:

{(Σ_{i = 1}^{N} {(x^{i} - y^{i})}^{2})}^{\frac{1}{2}}

312) transient state pond is exported to the binary sequence pattern of having learnt in pond, space, divides into groups to sequence pattern according to the time adjacency of sequence pattern in transient state pond, until after all sequence patterns are all grouped, grouping is calculated and finished.

Step 312) specifically comprise the following steps:

3121), when the input of transient state Chi Jieshoukongjianchi, binary sequence pattern, by rise time time correlation adjacency matrix, after the time, adjacency matrix formed, must be cut apart in groups.In HTM, adopt Greedy algorithm to realize time grouping.

3122) find the maximum not being included in grouping to connect cluster point.The maximum cluster point that connects is only that its corresponding row in time connection matrix has maximum and cluster value.

3123) select step 3122) the middle maximum front N that connects cluster point _top(N _topdesignated parameter) individual maximum neighbours' cluster point that connects, transient state pond adds these cluster points of selecting in current group.

3124) each is newly added to the cluster point X of grouping, repeating step 3123).All immediate N as X _topindividual neighbours' cluster point joins after grouping as X, and this grouping process will stop automatically.When packet count approaches a certain value (largest packet number), and grouping process is not when still automatically stop, and grouping process will be terminated.

3125) result set of cluster point will join transient state pond as a new grouping.Then return to step 3122) until all cluster points be grouped.

32) bottom layer node is through step 31) after the learning stage finishes, bottom layer node enters the derivation stage, new sample (binary sequence pattern) input is after bottom layer node is derived, export to its father node that is positioned at one deck under HTM network (for child node), there is the output of child node of identical father node after connecting, become the input of father node learning phase, father node enters learning phase repeating step 31) in the learning process of node.

Be illustrated in figure 4 the derivation stage schematic diagram that the binary sequence pattern of input is carried out at node, step 32) specifically comprise the following steps:

321) the probability distribution P (e of the spatial model of the binary sequence pattern of calculating input based on pond, space ^-| c _i), after Regularization as the output vector y in pond, space.

The spatial model that learning phase input binary sequence pattern generates in pond, space is i ^thcluster centre c _i, derivation node bottom list entries e ^-based on i ^thprobability distribution P (the e of cluster centre ^-| c _i) be variable, can calculate by following formula:

P (e^{-} | c_{i}) = γ \prod_{k = 1}^{M} input (m_{k}^{i}) - - - (1)

In formula (1), γ is proportionality constant, and i cluster centre is expressed as

,

represent non-vanishing spatial model, M is the quantity of the child node of this node, e ^-for the list entries to be identified from bottom.

representative input, if this node is bottom node,

input binary sequence pattern for this node; If this node is not bottom node,

for the transient state pond output probability distribution of the child node from this node,

(to P (e ^-| g _i) computing formula see formula (4)).All i ^thcluster centre c _iprobability distribution all can pass through P (e ^-| c _i) calculate, then by P (e ^-| c _i) canonical turns to vectorial y (i), therefore has y (i) and P (e ^-| c _i) proportional, can be designated as y (i) ∝ P (e ^-| c _i), all y (i) have formed the output vector y in this node space pond, are designated as y=[y (1), y (2) ..., y (N _c)] (N _cfor space pool space pool space number of modes), all P (e ^-| c _i) formed P (e ^-| C) be designated as P (e ^-| C)=[P (e ^-| c ₁), P (e ^-| c ₂) ...,

, therefore have y and P (e ^-| C) proportional, be designated as y ∝ P (e ^-| C).

322) the output vector y based on pond, space, the output in calculating transient state pond.

Transient state pond application Belief Propagation principle is carried out reasoning.As shown in Figure 4, pond, space is output as vectorial y, and this vector length is N _c(being also space pool space number of modes), in vector, i element is corresponding to i cluster centre c _i

, these cluster centres as vector [ (length is M), wherein r represents the subgroup index of these cluster centres.I the element computing formula of y is:

(i) = α_{1} \prod_{j = 1}^{M} λ^{m_{i}} (r_{m_{j}}) - - - (2)

In formula (2), α ₁be a random scaling constant, for fear of the underflow of information, it is set to fixed value conventionally, and M is child node number,

expression is from child node m _ibinary sequence pattern,

represent i cluster centre from child node m _isubgroup index.

According to formula (1) and step 321 procedure declaration, at node, k has y ^kwith P (e ^-| C ^k) proportional, i.e. y ^k∝ P (e ^-| C ^k), y ^kwith P (e ^-| C ^k) be respectively y and P (e- ^|c) at the example at node k place.

Output is calculated in the input of transient state pond based on pond, space.Be output as λ, its length is N _g(transient state pond time packet count), λ=[λ (1), λ (2) ..., λ (N _c)] i element computing formula as follows:

λ (i) = Σ_{j = 1}^{N_{c}} P (c_{j} | g_{i}) y (j) - - - (3)

In formula (3), P (c _j| g _i) represent spatial model c _jfor the g that divides into groups in transient state pond _iconditional probability distribution, y (j) representative is from j the element of the y in pond, space, the value of j is 1-N _c.Due to y (j) ∝ P (e ^-| c _j), and

P (e^{-} | g_{i}) = Σ_{j = 1}^{N_{c}} P (c_{j} | g_{i}) P (e^{-} {| c}_{j}) - - - (4)

P (e wherein ^-| g _i) represent bottom list entries e ^-based on transient state pond grouping g _iprobability distribution, P (e ^-| G ^k) be the upper all P (e of node k ^-| g _i) the vector of formation.

So λ (i) ∝ P (e ^-| g _i) for all i, set up, on node k, there is λ ^kwith P (e ^-| G ^k) proportional, i.e. λ ^k∝ P (e ^-| G ^k).The output in transient state pond is exactly the output of this node.

33) repeating step 32) process, until the node of all layer of HTM network has all completed the learning training of binary sequence pattern.

As step 32), after one deck has been trained, this node layer proceeds to the derivation stage, and next node layer (father node) utilizes the output of last layer node (child node) as input, to carry out the study of sequence pattern.

4) utilize byte level n-gram to extract the proper vector of the file in test set.

As step 1), utilize byte level n-gram to extract the proper vector of test centralized documentation.The malicious code test data that test set can provide from network is concentrated and is chosen.

As step 322) derivation, in whole HTM structure, all node layers are all in the derivation stage, utilize the proper vector that pond, space sequence pattern is derived and the grouping derivation of transient state pond is extracted step 4) to carry out pattern derivation, the output vector λ of top mode is the output mode vector of whole HTM network, the output probability P (e in top mode transient state pond ^-| G ^k) be malicious code matching rate.As the output probability P in top mode transient state pond (e ^-| G ^k) when enough large, such as we are set as being greater than 85%, we just can think that input file contains malicious code so, otherwise think there is no malicious code.

The present invention is usingd the sample set of malicious file as training set, utilizes HTM algorithm for pattern recognition training HTM network, then utilizes HTM network to carry out pattern-recognition to unknown file and classification is derived, to determine whether it is malicious file.In the process of file being carried out to feature extraction, adopt byte level n-gram algorithm, a large amount of file characteristic attributive character is extracted.In pattern-recognition and classification learning algorithm, introduce HTM algorithm, this algorithm is to imitate the structure of mankind's neopallium and the novel artificial intelligent algorithm of principle of work, in its application Bayesian network, between node, information continues to share principle and degree of belief transfer principle, challenge is converted into pattern match and prediction, through training, extract spatial sequence pattern and the temporal mode grouping of sample, and utilize Belief Propagation method that each layer of local mode group gathered to classification, finally obtain one-piece pattern group, at cognitive phase, according to the sequence pattern of each layer of study, through overmatching, complete malicious code sample identification.HTM algorithm, because of its good anti-noise, fault-tolerant, adaptability, self-learning capability, can effectively improve discrimination.

The above; be only the present invention's embodiment preferably, but protection scope of the present invention is not limited to this, is anyly familiar with in technical scope that those skilled in the art disclose in the present invention; the variation that can expect easily or replacement, within all should being encompassed in protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. the degree of deep learning detection method of unknown malicious code, is characterized in that, comprises the following steps:

2. the degree of deep learning detection method of unknown malicious code according to claim 1, is characterized in that, described step 2) specifically comprise the following steps:

3. the degree of deep learning detection method of unknown malicious code according to claim 1, is characterized in that, described step 3) specifically comprises the following steps:

33) repeating step 32) process, until the node of all layer of HTM network has all completed the learning training of sequence pattern.

4. the degree of deep learning detection method of unknown malicious code according to claim 3, is characterized in that, described step 31) specifically comprise the following steps:

311) binary sequence that is input to node is input to pond, space, and the cluster of these sequences is learnt in pond, space with ultimate range D; Pond, space is used the method for ultimate range D to store the subset of input pattern, is called cluster centre; Along with the increase of time, the quantity of the new sequence pattern that pond, space produces within the unit interval can reduce, when the quantity of the new cluster centre of each time cycle lower than set threshold value time, cluster process will stop;

5. the degree of deep learning detection method of unknown malicious code according to claim 3, is characterized in that, described step 32) specifically comprise the following steps:

321) utilize formula

calculate list entries e ^-spatial model c based on node space pond _iprobability distribution, after Regularization as the output in pond, space, wherein

represent the spatial model of non-zero, M is the quantity of the child node of this node, e ^-for the list entries to be identified from bottom;

322) the output y based on pond, space, utilizes formula