CN109753798A

CN109753798A - A kind of Webshell detection model based on random forest and FastText

Info

Publication number: CN109753798A
Application number: CN201811507276.3A
Authority: CN
Inventors: 方勇; 黄诚; 张磊; 邱瑶瑶; 苏瑜
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2019-05-14

Abstract

Remote access wooden horse based on Web is a kind of tool for network intrusions, can upload to website to access Web service administration authority.Once attacker successfully injects, huge destruction will result in, therefore it is most important that Webshell is effectively detected.By using obfuscation, Webshell has flexibility and changeability, and which increase the difficulty of detection.The invention proposes a kind of PHP Webshell detection model, which is based on FastText algorithm and random forests algorithm, referred to as FRF-WD, and the important feature that PHP sequence of opcodes will be detected as Webshell.The model that the present invention designs verification and measurement ratio with higher and lower rate of false alarm.

Description

A kind of Webshell detection model based on random forest and FastText

Technical field

The present invention designs a kind of PHP Webshell detection model based on random forest Yu FastText algorithm.The model By extracting the sequence of opcodes based on Zend engine, label is extracted after carrying out textual classification model differentiation to sequence of opcodes, It is accurately and effectively detected using the sorting algorithm realization based on random forest (Random Forest) using PHP language Webshell malicious script.

Background technique

With the development that Web is applied, the remote access wooden horse (Webshell) based on Web becomes a kind of and is used for network intrusions Tool, attacker can be uploaded to Web server with obtain access service management permission.Once attacker successfully infuses Enter, using the fragility of server, will cause huge loss, therefore it is vital that Webshell, which is effectively detected,. Webshell is had the feature that flexibly and can be changed, is increased the difficulty of detection with this by using obfuscation.This paper presents one Kind uses the Webshell detection model of PHP language, models coupling FastText and random forests algorithm, referred to as FRF-WD. Wherein important feature of the PHP sequence of opcodes as detection Webshell.The experimental results showed that model detection with higher Rate and lower rate of false alarm, it was demonstrated that the feasibility and validity of the model.

Detection to Webshell is essential a part in malicious web pages detection, and existing many research at Fruit.Two factors of most important one are exactly feature extraction and detection model.The Webshell feature extracted from different perspectives Method substantially there are five types of, be the longest string length of file respectively, comentropy is overlapped index, and characteristic function and blacklist close Key word.

Optimal threshold recognition methods based on malice function and malice feature samples can be used for Webshell detection, but may The legitimate files comprising a small amount of malice function apocrypha can be identified as to cause to judge by accident.

Similarity degree based on similar matrix analysis sample set detects PHP Malware.This method is by using four kinds Different similarities: content, the title of user-defined function and the file of decoding sample extraction user-defined function body are fuzzy Hash carries out similarity analysis to PHP Malware sample set.

Webshell detection model research based on machine learning is extensive.Such as the Webshell detection based on matrix decomposition Method.But this method does not determine whether to classify to page properties.

Method based on support vector machines (SVM).This method is analyzed by the HTML characteristic to webpage, compares support Two kernel functions in vector machine: linear kernel function and gaussian radial basis function, the former has higher recall rate.

Webshell is detected using Web log, since Webshell is usually an individual file, Main Analysis file Access path and parameter, the frequency and page relevance that access file compare the difference between Webshell and normal Web document. However the method for only using these features may have very high false positive rate, as a result there is reduced possibility in accuracy rate.

It is also useful in addition to the above method there are also the method for combining static and dynamic state technology to disclose Webshell feature How honey jar research attacker utilizes Webshell.

Detection technique is also from traditional pattern match, machine learning like a raging fire till now, the inspection for Webshell Direction finding more automatic, intelligence direction develop, the requirement to testing result is also not only accurate, exhaustive to identify Know the attack of type, more wants to fight and various obscure means.

The problem that feature extraction and detection for Webshell mainly solve is.

(1) how the extraction that sequence of opcodes is carried out based on Zend engine is carried out to the Webshell of PHP file edit.

(2) how parameter selection and optimization are carried out so that its generation is suitable for current operation code sequence to FastText algorithm Textual classification model.

(3) how to construct suitable machine learning algorithm and the detection effect of PHP Webshell is tested.

This system emphasis solves three above problem, realizes that the Webshell based on operation code detects mould Type.

Summary of the invention

The invention is to carry out operation code parsing, based on word- in FastText algorithm using open source script engine Zend Ngram concept carries out textual classification model training, a variety of static natures based on statistics extract, based on machine learning algorithm RF into The advanced model of the multinomial technological development of row classification.By being pre-processed to sample data, to Webshell code therein It carries out static nature and sequence of opcodes is extracted, the Webshell in PHP file is detected using RF disaggregated model.

The invention aims at following target.

(1) the operation code opcodes that model generates the PHP file compiling generated by VLD extension, using FastText Model carries out feature extraction to sequence of opcodes and classifies, and detects currently to input whether PHP file is Webshell.

(2) model can pre-process the sample code being collected into, and can extract the operation code of PHP file simultaneously Text classification is carried out, as preliminarily detection-phase.In the training stage, also the result of a variety of static natures and Preliminary detection is made It goes to train final classification and Detection model for set.

Model carries out operation code parsing using open source script engine Zend, to the operation code in the operation code file of generation Sequence extracts.It has ability in feature extraction, can extract a variety of static statistics features in data sample, then these are special Sign is put into characteristic set, can handle it by textual classification model with sequence of opcodes according to the static nature in data sample Result afterwards is trained RF detection model.

To achieve the goals above, which adopts the technical scheme that based on random forest and FastText Webshell detection model mainly includes three parts: data preparation, feature extraction, Webshell detection.

Data preparation mainly completes the Data Preparation before operation code and statistical nature extraction operation, including collects just Negative sample, repeating sample this document is filtered, marks positive negative sample etc..

Static statistics feature extraction includes longest string length, comentropy, is overlapped index, characteristic function and blacklist pass Key word.The extraction of operation code is main to complete operation code generation and the extraction of sequence of opcodes preservation.

The RF algorithm that Webshell detection is mainly based upon 100 decision trees composition carries out the training of disaggregated model and unknown The classification of type sample determines.In the training stage, need to debug hyper parameter configuration in disaggregated model, it is optimal to train Disaggregated model.

Detailed description of the invention

Fig. 1 is model training and detection structure figure of the invention.

Specific embodiment

The Webshell detection model based on random forest and FastText includes three modules: data preparation module, Characteristic extracting module, Webshell detection module.

Data source.

As shown in Figure 1 be model training and detection structure figure, be illustrated in detail Webshell detection model training and Detect the process in two stages.By pre-processing to sample file, the static statistics feature of extraction document is complete using VLD At the generation of PHP operation code, textual classification model training is carried out to operation code by FastText and obtains corresponding marker characteristic, The feature that static statistics feature and operation code marker characteristic are combined as entire detection model is inputted.In the training stage, according to The characteristic set training that static statistics feature and marker characteristic are formed generates disaggregated model.In detection-phase, with same operation Process is completed to extract PHP file static nature to be measured, is then detected using the textual classification model that the training stage generates to be measured The operation code of PHP file, further according to the random forest detection model generated by the training stage to the characteristic set of PHP file to be measured Carry out classification judgement.

Wherein, FastText has an important parameter wordNgrams in natural language processing, is herein word grade N- Gram can allow single operation code to be combined into a packet sequence with remaining operation code group as a whole and be input to text classification mould Type is trained, can achieve with the comparable performance of method based on deep learning, but FastText speed is faster.FastText Model is made of matrix A and B, and matrix A is textual words look-up table, and matrix B is used for classifier.Word is averagely represented as text This, text representation is fed back again to linear classifier.The probability distribution of predefined class is obtained using softmax function f.For one group Size is the document of N, wherein n-th of document is x_n, corresponding label is_n.Formula is as follows:

Characteristic set is trained using random forest RF algorithm in disaggregated model, by objective function, majorized function, The debugging of the parameters such as batch-size, epochs trains disaggregated model optimal under current environment.When the new test specimens of input This when, carries out classification judgement to sample using trained disaggregated model.

Claims

1. the invention discloses a kind of Webshell detection model based on random forest and FastText, feature includes following Step:

A, preprocessed data extracts five kinds of static natures of PHP file training set, and the element as characteristic set is in subsequent step It will be used to train Random Forest model；

B, the sequence of opcodes that PHP file is parsed using Vulcan Logic Disassembler (VLD) extension, will mark The sequence of opcodes crossed passes through the processing of FastText algorithm, generates FastText text classifier model；

C, the sequence of opcodes extracted in step B is input in text classifier and predicts its corresponding label, and be added to Characteristic set containing the static nature in step 1；

D, the characteristic set in step C is trained based on the sorting algorithm of random forest, generates binary classification model；

E, same steps are carried out to PHP file test set using the text prediction model and binary classification model generated in preceding four step Processing, obtains final predicted value.

WordNgrams parameter determines n=4 in N-gram in 2.FastText, obtains the general of predefined class using softmax function Rate distribution, optimizes the marking speed in training and prediction.

3. the tag along sort feature extraction and deep learning algorithm random forest structure according to claim 1 based on operation code The disaggregated model built, it is characterised in that: the static nature based on PHP file extracts, including the operation based on Zend analytics engine Code generates, and the text classifier model based on FastText after optimization generates and tag along sort extracts；Based on random forest RF's Disaggregated model determines, by the debugging to model hyper parameter, to train optimal Webshell PHP code disaggregated model.

4. the sorter model after the algorithm optimization according to claim 1 based on FastText, it is characterised in that: FastText can make full use of the classification feature of softmax function in natural language processing, traverse all leaf segments of classification tree Point finds the label of maximum probability, the enough situations of this method applicable data collection, the text classification for supervised learning Tasking learning speed ratio is very fast.