CN109886021A

CN109886021A - A kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network

Info

Publication number: CN109886021A
Application number: CN201910123187.7A
Authority: CN
Inventors: 高雅琪; 詹静; 樊旭东; 范雪; 刘一帆
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-02-19
Filing date: 2019-02-19
Publication date: 2019-06-14

Abstract

The invention discloses a kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network, the malicious code detecting method includes two stages: the training stage of (S1) known sample, and the purpose in this stage predominantly obtains the network model using known sample after training.(S2) forecast period of unknown sample, the purpose of forecast period are mainly to use whether the network model prediction unknown code in (S1) is malicious code.Since a series of system API can be triggered when malicious code carries out long-range attack, usually there is the combined sequence more frequently occurred.Recognition with Recurrent Neural Network has unique advantage in terms of handling timing information, by this advantage in conjunction with API timing, it is proposed a kind of malicious code detecting method based on API time series, realize the Malicious Code Detection of automation, the accuracy and detection rates of detection are improved, and can identify more unknown malicious codes.

Description

It is a kind of to be examined based on API overall situation term vector and the malicious code of layered circulation neural network Survey method

Technical field

The present invention relates to Malicious Code Detection field more particularly to a kind of Malicious Code Detection sides based on time series Method belongs to field of computer technology.

Background technique

With the rapid development of computer and networks, bring people it is many convenient while also bring the certain prestige of people The side of body.Network hacker initiates various malicious attacks for various network holes.The propagation of malicious code can not only interfere network and The normal use of software destroys significant data, causes heavy losses to personal and enterprise.

The Malicious Code Detection of comparative maturity mainly passes through detection malicious code (such as base of the feature in matching characteristic library at present In the detection method of signature).This method is very high to feature Detection accuracy present in database, but after being unable to identity confusion With unknown malicious code.The monitoring of behavior-based detection method is program activity behavior, by executing correlative code capture Behavioural information is not influenced by obfuscation, and can identify unknown malicious code to a certain extent.However, above two side Method requires a large amount of Heuristicses of related fields expert, cannot achieve automatic detection.

Deep learning is one of the technology that artificial intelligence field is with fastest developing speed in recent years, is had in natural language processing etc. (Entity recognition, Chinese text sentiment analysis e.g., are named, article is classified, and part-of-speech tagging, machine turns in timing information related fields Translate, conversational system etc., Recognition with Recurrent Neural Network etc.), achieve huge progress.It can be triggered a series of contain during Malicious Code Detection Have a timing information API behavior sequence, Recognition with Recurrent Neural Network can by learn in it behavior timing information detect malice generation Code, therefore there is preferable application prospect in unknown malicious code context of detection.

Summary of the invention

The present invention utilizes deep learning thought, using dynamic behaviour analytical technology, proposes a kind of based on API time series Malicious code detecting method.Since a series of system API can be triggered when malicious code carries out long-range attack, usually have more frequently The combined sequence of appearance.Recognition with Recurrent Neural Network has unique advantage in terms of handling timing information, by this advantage and API timing knot It closes, proposes a kind of malicious code detecting method based on API time series, realize the Malicious Code Detection of automation, improve inspection The accuracy and detection rates of survey, and can identify more unknown malicious codes.

The technical solution adopted by the present invention is a kind of based on global term vector and layered circulation neural network (Slice-Long Short-Term Memory Networks, S-LSTM) malicious code detecting method, the malicious code detecting method include two A stage: the purpose of the training stage of S1 known sample, this stage predominantly obtain the network using known sample after training Model.(S2) forecast period of unknown sample, the purpose of forecast period are mainly to use the network model prediction in (S1) unknown Whether code is malicious code.

It includes three modules that wherein the training stage of (S1) known sample, which has altogether: (S1-1) character representation module, (S1-2) Global term vector generation module, (S1-3) S-LSTM network training module.

(S2) it includes two modules that the forecast period of unknown sample, which has altogether: (S2-1) character representation module, the operation of this module Process is identical as (S1-1), (S2-2) S-LSTM neural network forecast module.

Following introduction is done to the above-mentioned module being related to:

Firstly, the module being related to the training stage of (S1) known sample does following introduction:

(S1-1) character representation module includes the following steps:

Step 1, sample is collected.Collect malicious code, normal code and code mark composition sample training collection.

Step 2, sample API sequence is obtained.The code that step 1 is collected into is executed in virtual machine, uses API Hook skill The API called in art capture code implementation, and API sequence is formed according to the sequencing of calling.

(S1-2) global term vector generation module includes the following steps:

Step 1, sample vocabulary C is generated.API statistics is carried out to the API sequence generated in (S1-1), forms API vocabulary Table C, C={ api₁,api₂,...,api_n, n indicates the number of api in vocabulary C.

Step 2, corresponding semantic term vector is generated to API all in vocabulary C.Using in word2vec method CBOW model is trained the API sequence generated in (S1-1), obtain each API in vocabulary C containing semantic information Term vector.

Step 3, corresponding information gain value is calculated to API all in vocabulary C.Use information gain method calculates word The information gain value of each API in remittance table C.

Step 4, corresponding global term vector is generated to API all in vocabulary C.For each of vocabulary C API, the term vector obtained in step 2 obtain the global term vector of each API multiplied by information gain value corresponding in step 3 Representation method forms global term vector vocabulary.

(S1-3) S-LSTM network training module includes the following steps:

Step 1, slicing operation is carried out to network inputs sequence.API sequence obtained in (S1-1) is truncated and is filled out Operation is filled to uniform length, and cutting is carried out to the sequence after operation, so that sub-sequence length is suitable and meets S-LSTM net The input requirements of network.

Step 2, network hyper parameter is set.To the number of such as network training data set of the hyper parameter in S-LSTM network Epochs, the sample number batch_size that network is trained every time, learning rate α are configured.

Step 3, training S-LSTM network model.By the API sequence generated in S1-1 the global word generated in (S2-1) Vector indicates, and the input as S-LSTM network, obtains S-LSTM network model after training.

Step 4, network model is evaluated.Network training process uses 5 folding cross validations, wherein 4 parts are used as training Collection is left portion and is used as test set, and accuracy of the invention is the average accuracy of 5 folding cross validations, when average accuracy is small When 98%, return step 2 is adjusted network hyper parameter, until network average accuracy is higher than 98%.

Secondly, the module being related to the forecast period of (S2) unknown sample does following introduction:

(S2-1) character representation module step is identical as (S1-1), obtains the API sequence of forecast sample.

(S2-2) S-LSTM neural network forecast module includes the following steps:

Step 1, utilize (S1-2) in generate global term vector vocabulary, will in (S2-1) API sequence with the overall situation word to Amount indicates.

Step 2, the input of the S-LSTM network term vector in step 1 generated as (S1-3) training, obtains unknown The testing result of sample.

This method detects malicious code using the thought of deep learning, compared with other detection methods, has as follows Benefit:

1, the invention proposes a kind of global term vector methods based on detection importance.Traditional term vector side word2vec Method merely illustrates context words correlative relationship, and API overall situation term vector method proposed by the present invention is by API to the important of detection Property information incorporates in traditional context dependence information, improves the accuracy rate of Malicious Code Detection.Using identical data Under the premise of sample set (2000 malice samples and 910 non-malicious samples) and LSTM network carry out Malicious Code Detection, warp The discovery of 5 folding cross validations is crossed, (5 foldings intersect with input of the term vector of classical word2vec method output as LSTM is used The average detected accuracy of verifying is 98.69%) to compare, using the term vector of global term vector method output proposed by the present invention As the input (the average detected accuracy of 5 folding cross validations is 98.8%) of LSTM, detection accuracy, which has to stablize, improves (5 foldings Cross validation accuracy improves 0.09% to 0.14% and differs, 0.11%) average accuracy improves.

2, the invention proposes a kind of quick sides of detection of layered circulation neural network suitable for Malicious Code Detection scene Method.Due to can trigger a large amount of API in code operational process, such as the API sequence of data sample triggering that uses of the present invention is flat Equal length is 19000, and the API sequence signature of extraction can be excessive, too long so as to cause detection time.The present invention is by S-LSTM network Applied to Malicious Code Detection scene, overlength API sequence is divided into multiple subsequences, subsequence is carried out simultaneously using multitiered network Row detection.On using identical data sample set and use premise of traditional term vector word2vec method as network inputs Under, compared with carrying out detection using tradition LSTM network, the malicious code proposed by the present invention based on layered circulation neural network Detection method, can will test the time was reduced to 99 minutes from 750 minutes, and detection time reduces 86.8%.

3, the Malicious Code Detection side based on API overall situation term vector and layered circulation neural network that the method for the present invention proposes Method has detection high degree of automation, accurately identifies unknown malicious code behavioral characteristic.The degree of automation side is detected improving Face, this method only need not needing volume compared with existing machine learning algorithm to the malicious carry out handmarking of existing sample Outer progress API behavioural characteristic selection is conducive to improve detection the degree of automation；Accurately identifying unknown malicious code behavior side Face, this method mainly carry out malicious code identification, therefore energy by the code API behavior sequential relationship that Recognition with Recurrent Neural Network is found Enough identify malicious code unknown but with similar behavior.And carry out the machine learning algorithm of malicious code Activity recognition usually not Sequential relationship between these API of Direct Recognition, but based on multiple selection feature API (such as setting Shared Folders NetShareAdd forces to terminate process TerminateProcess etc.) comprehensive detection is carried out, therefore rely more on sample matter Amount.With k nearest neighbor algorithm (accuracy 97.66%), support vector machines (accuracy 96.49%), (accuracy is decision tree Etc. 97.94%) machines in normal service learning algorithm is compared, and it is 98.86% that this method, which detects accuracy, accuracy be obviously improved (point 1.2%, 2.37%, 0.92% is not improved).

Detailed description of the invention

Fig. 1 overall framework figure of the present invention

Fig. 2 overall situation term vector model structure

Fig. 3 S-LSTM network structure

Specific embodiment

The present invention is described further with reference to the accompanying drawings and detailed description.

Integrated stand composition of the invention is as shown in Figure 1, malicious code detecting method includes two stages: (S1) known sample Training stage, the purpose in this stage predominantly obtains the network model using known sample after training.(S2) unknown sample Forecast period, the purpose in this stage is mainly to use whether the network model prediction unknown code in (S1) is malicious code.

It includes 3 modules that wherein the training stage of (S1) known sample, which has altogether: (S1-1) character representation module, (S1-2) are complete Office's term vector generation module, (S1-3) S-LSTM network training module.

(S2) it includes 2 modules that the forecast period of unknown sample, which has altogether: (S2-1) character representation module, this module were run Journey is identical as (S1-1), (S2-2) S-LSTM neural network forecast module.

(S1-1) character representation module includes the following steps:

Step 1, sample is obtained.Collect malicious code, normal code and code mark composition sample training collection.Malice sample This comes from http://academictorrents.com/, and normal sample comes from system file and http: // xiazai.zol.com.cn/。

(S1-2) global term vector generation module is as shown in Fig. 2, include the following steps:

Step 2, corresponding semanteme term vector v (w) is generated to API all in vocabulary C.Use classical word2vec CBOW model in method is trained the API sequence generated in (S1-1), and obtain each API in vocabulary C contains language The term vector v (w) of adopted information.

CBOW model structure shown in CBOW model, is divided into input layer, is projected out, output layer on the left of Fig. 2.CBOW model is With word, that is, Context (w)=w around_-c,...,w_-1,w₁,...,w_cTo predict that centre word w, w, that is, API in the present invention, c indicate window Mouth size.The probability that centre word w occurs in the context that window is c, CBOW are indicated with conditional probability p (w | Content (w)) The optimization aim of model isIn order to solve the part most maximum value of G, i.e., so that in vocabulary The conditional probability of any API is maximum, firstly, constructing the negative sample collection about w using random negative sampling method, vocabulary is not w's API is known as negative sample, indicates negative sample collection with NEG (w)；Secondly, optimized using stochastic gradient climb procedure to G, when reaching When to maximum number of iterations, G reaches local maximum.

Step 3, corresponding information gain value IG (w) is calculated to API all in vocabulary C.Use information gain method Calculate the information gain value of each API in vocabulary C.Information gain value indicates that API is classification bring information content, bring letter Breath amount is more, and the API is more important.

Step 4, corresponding overall situation term vector V (w) is generated to API all in vocabulary C.For every in vocabulary C A API, is indicated with w, and the term vector v (w) obtained in step 2 is multiplied by information gain value IG (w), i.e. V corresponding in step 3 (w)=v (w) * IG (w) obtains global term vector V (w) representation method of each API, forms global term vector vocabulary, and protect There are in G_CBOW_File file.

(S1-3) S-LSTM network training module includes the following steps:

Step 1, cutting list entries constructs S-LSTM network structure.To API sequence obtained in S1-1 carry out truncation and Padding carries out cutting to uniform length, and to the sequence after operation, so that sub-sequence length is suitable.And it constructs and is suitble to S-LSTM network of the invention, S-LSTM network structure include input layer, hidden layer, output layer.This step introduces S-LSTM net The input layer and hidden layer of network, output layer are introduced in step 3.

Assuming that list entries length is [x₁,x₂,...,x_T], wherein x indicates the input at each moment, and T indicates the length of sequence Degree.It is n subsequence, the length t=T/n of subsequence N by sequence X cutting.Therefore list entries X is represented by X=[N₁, N₂,...,N_n], for given subsequence N_pIt is represented by N_p=[x_(p-1)*t+1,x_(p-1)*t+2,...,x_p*t].It equally, then will Subsequence N is divided into n isometric subsequences, and repeats such operation k times, until the sub-sequence length of the bottom is closed It is suitable, then by k segmentation, obtain k+1 layer network.The minimum sub-sequence length of 0th layer network is0th layer of minimum Subsequence quantity is s₀=n^k, the sub-sequence length of rest network layer is l_p=n, subsequence quantity are s_p=n^k-p, wherein p is net The number of plies of network.

The API sequence average length extracted in the present invention is up to 19000, and network model acquirement is preferably imitated in k=2 Fruit.Therefore, T=19683, k=2 in the present invention, in order to enable whole point of sequence to set 27 for n.S-LSTM network of the present invention is such as Shown in Fig. 3, the length T=19683 of network input layer obtains 3 layers of hidden layer by 2 slicing operations.The 0th straton of hidden layer The quantity of sequence is 27, sub-sequence length 729；The 1st layer sub-sequence quantity of hidden layer is 27, sub-sequence length 27；It hides The 2nd layer sub-sequence length of layer is 1, and sub-sequence length 27 obtains final hidden layer state F by 3 layers of hidden layer.

Step 2, network hyper parameter is set.To the hyper parameter in S-LSTM network, network training number is set based on experience value According to the number epoch=15 of collection, the sample number batch_size=30 that network is trained every time, learning rate α=0.01.

Step 3, training S-LSTM network model.The API sequence overall situation generated in (S2-1) that will be generated in (S1-1) Term vector indicates, and the input as S-LSTM network, by obtaining final hidden layer state F after three layers of hidden layer, and passes through Softmax function obtains network output valveSuch as the output layer in Fig. 3.Pass through binary_ in network training process Crossentropy loss function isThe loss of network is calculated, wherein y indicates actual value,Indicate output valve. Web vector graphic Adam algorithm optimizes network, and when the maximum number of iterations is reached, network stops optimization.

Step 4, network model is evaluated.Network training process uses 5 folding cross validations, wherein 9 parts are used as training Collection is left portion and is used as test set, and accuracy of the invention is the average accuracy of 5 folding cross validations, when average accuracy is small When 98%, return step 2 is adjusted network hyper parameter, until network average accuracy is higher than 98%.

(S2-2) S-LSTM neural network forecast module includes the following steps:

Claims

1. a kind of malicious code detecting method based on API overall situation term vector and layered circulation neural network, it is characterised in that: should Malicious code detecting method includes two stages: the training stage of S1 known sample, the purpose in this stage predominantly obtain using The network model of known sample after training；(S2) purpose of the forecast period of unknown sample, forecast period predominantly uses (S1) whether the network model prediction unknown code in is malicious code；

It includes three modules that wherein the training stage of (S1) known sample, which has altogether: (S1-1) character representation module, (S1-2) are global Term vector generation module, (S1-3) S-LSTM network training module；

(S2) it includes two modules that the forecast period of unknown sample, which has altogether: (S2-1) character representation module, this module operational process It is identical as (S1-1), (S2-2) S-LSTM neural network forecast module.

2. according to claim 1 a kind of based on API overall situation term vector and the inspection of the malicious code of layered circulation neural network Survey method, it is characterised in that: (S1-1) character representation module includes the following steps:

Step 1, sample is collected；Collect malicious code, normal code and code mark composition sample training collection；

Step 2, sample API sequence is obtained；The code that step 1 is collected into is executed in virtual machine, is caught using API Hook technology The API called in code implementation is obtained, and forms API sequence according to the sequencing of calling.

3. according to claim 1 a kind of based on API overall situation term vector and the inspection of the malicious code of layered circulation neural network Survey method, it is characterised in that: (S1-2) overall situation term vector generation module includes the following steps:

Step 1, sample vocabulary C is generated；API statistics is carried out to the API sequence generated in (S1-1), forms API vocabulary C, C ={ api₁,api₂,...,api_n, n indicates the number of api in vocabulary C；

Step 2, corresponding semantic term vector is generated to API all in vocabulary C；Use the CBOW mould in word2vec method Type is trained the API sequence generated in (S1-1), obtains the term vector containing semantic information of each API in vocabulary C；

Step 3, corresponding information gain value is calculated to API all in vocabulary C；Use information gain method calculates vocabulary The information gain value of each API in C；

Step 4, corresponding global term vector is generated to API all in vocabulary C；For each API in vocabulary C, use Term vector obtained in step 2 obtains the global term vector expression side of each API multiplied by information gain value corresponding in step 3 Method forms global term vector vocabulary.

4. according to claim 1 a kind of based on API overall situation term vector and the inspection of the malicious code of layered circulation neural network Survey method, it is characterised in that: (S1-3) S-LSTM network training module includes the following steps:

Step 1, slicing operation is carried out to network inputs sequence；Behaviour is truncated and filled to API sequence obtained in (S1-1) It accomplishes uniform length, and cutting is carried out to the sequence after operation, so that sub-sequence length is suitable and meets S-LSTM network Input requirements；

Step 2, network hyper parameter is set；To the number epochs of such as network training data set of the hyper parameter in S-LSTM network, The sample number batch_size that network is trained every time, learning rate α are configured；

Step 3, training S-LSTM network model；By the API sequence generated in S1-1 the global term vector generated in (S2-1) It indicates, and the input as S-LSTM network, obtains S-LSTM network model after training；

Step 4, network model is evaluated；Network training process uses 5 folding cross validations, wherein 4 parts are used as training set, remains Lower a as test set, accuracy of the invention is the average accuracy of 5 folding cross validations, when average accuracy is less than 98% When, return step 2 is adjusted network hyper parameter, until network average accuracy is higher than 98%.

5. according to claim 1 a kind of based on API overall situation term vector and the inspection of the malicious code of layered circulation neural network Survey method, it is characterised in that: (S2-1) character representation module step is identical as (S1-1), obtains the API sequence of forecast sample.

6. according to claim 1 a kind of based on API overall situation term vector and the inspection of the malicious code of layered circulation neural network Survey method, it is characterised in that: (S2-2) S-LSTM neural network forecast module includes the following steps:

Step 1, the global term vector vocabulary generated in (S1-2) is utilized, by the global term vector table of API sequence in (S2-1) Show；

Step 2, the input of the S-LSTM network term vector in step 1 generated as (S1-3) training, obtains unknown sample Testing result.