CN109753798A - A kind of Webshell detection model based on random forest and FastText - Google Patents

A kind of Webshell detection model based on random forest and FastText Download PDF

Info

Publication number
CN109753798A
CN109753798A CN201811507276.3A CN201811507276A CN109753798A CN 109753798 A CN109753798 A CN 109753798A CN 201811507276 A CN201811507276 A CN 201811507276A CN 109753798 A CN109753798 A CN 109753798A
Authority
CN
China
Prior art keywords
model
fasttext
webshell
random forest
php
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811507276.3A
Other languages
Chinese (zh)
Inventor
方勇
黄诚
张磊
邱瑶瑶
苏瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201811507276.3A priority Critical patent/CN109753798A/en
Publication of CN109753798A publication Critical patent/CN109753798A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Remote access wooden horse based on Web is a kind of tool for network intrusions, can upload to website to access Web service administration authority.Once attacker successfully injects, huge destruction will result in, therefore it is most important that Webshell is effectively detected.By using obfuscation, Webshell has flexibility and changeability, and which increase the difficulty of detection.The invention proposes a kind of PHP Webshell detection model, which is based on FastText algorithm and random forests algorithm, referred to as FRF-WD, and the important feature that PHP sequence of opcodes will be detected as Webshell.The model that the present invention designs verification and measurement ratio with higher and lower rate of false alarm.

Description

A kind of Webshell detection model based on random forest and FastText
Technical field
The present invention designs a kind of PHP Webshell detection model based on random forest Yu FastText algorithm.The model By extracting the sequence of opcodes based on Zend engine, label is extracted after carrying out textual classification model differentiation to sequence of opcodes, It is accurately and effectively detected using the sorting algorithm realization based on random forest (Random Forest) using PHP language Webshell malicious script.
Background technique
With the development that Web is applied, the remote access wooden horse (Webshell) based on Web becomes a kind of and is used for network intrusions Tool, attacker can be uploaded to Web server with obtain access service management permission.Once attacker successfully infuses Enter, using the fragility of server, will cause huge loss, therefore it is vital that Webshell, which is effectively detected,. Webshell is had the feature that flexibly and can be changed, is increased the difficulty of detection with this by using obfuscation.This paper presents one Kind uses the Webshell detection model of PHP language, models coupling FastText and random forests algorithm, referred to as FRF-WD. Wherein important feature of the PHP sequence of opcodes as detection Webshell.The experimental results showed that model detection with higher Rate and lower rate of false alarm, it was demonstrated that the feasibility and validity of the model.
Detection to Webshell is essential a part in malicious web pages detection, and existing many research at Fruit.Two factors of most important one are exactly feature extraction and detection model.The Webshell feature extracted from different perspectives Method substantially there are five types of, be the longest string length of file respectively, comentropy is overlapped index, and characteristic function and blacklist close Key word.
Optimal threshold recognition methods based on malice function and malice feature samples can be used for Webshell detection, but may The legitimate files comprising a small amount of malice function apocrypha can be identified as to cause to judge by accident.
Similarity degree based on similar matrix analysis sample set detects PHP Malware.This method is by using four kinds Different similarities: content, the title of user-defined function and the file of decoding sample extraction user-defined function body are fuzzy Hash carries out similarity analysis to PHP Malware sample set.
Webshell detection model research based on machine learning is extensive.Such as the Webshell detection based on matrix decomposition Method.But this method does not determine whether to classify to page properties.
Method based on support vector machines (SVM).This method is analyzed by the HTML characteristic to webpage, compares support Two kernel functions in vector machine: linear kernel function and gaussian radial basis function, the former has higher recall rate.
Webshell is detected using Web log, since Webshell is usually an individual file, Main Analysis file Access path and parameter, the frequency and page relevance that access file compare the difference between Webshell and normal Web document. However the method for only using these features may have very high false positive rate, as a result there is reduced possibility in accuracy rate.
It is also useful in addition to the above method there are also the method for combining static and dynamic state technology to disclose Webshell feature How honey jar research attacker utilizes Webshell.
Detection technique is also from traditional pattern match, machine learning like a raging fire till now, the inspection for Webshell Direction finding more automatic, intelligence direction develop, the requirement to testing result is also not only accurate, exhaustive to identify Know the attack of type, more wants to fight and various obscure means.
The problem that feature extraction and detection for Webshell mainly solve is.
(1) how the extraction that sequence of opcodes is carried out based on Zend engine is carried out to the Webshell of PHP file edit.
(2) how parameter selection and optimization are carried out so that its generation is suitable for current operation code sequence to FastText algorithm Textual classification model.
(3) how to construct suitable machine learning algorithm and the detection effect of PHP Webshell is tested.
This system emphasis solves three above problem, realizes that the Webshell based on operation code detects mould Type.
Summary of the invention
The invention is to carry out operation code parsing, based on word- in FastText algorithm using open source script engine Zend Ngram concept carries out textual classification model training, a variety of static natures based on statistics extract, based on machine learning algorithm RF into The advanced model of the multinomial technological development of row classification.By being pre-processed to sample data, to Webshell code therein It carries out static nature and sequence of opcodes is extracted, the Webshell in PHP file is detected using RF disaggregated model.
The invention aims at following target.
(1) the operation code opcodes that model generates the PHP file compiling generated by VLD extension, using FastText Model carries out feature extraction to sequence of opcodes and classifies, and detects currently to input whether PHP file is Webshell.
(2) model can pre-process the sample code being collected into, and can extract the operation code of PHP file simultaneously Text classification is carried out, as preliminarily detection-phase.In the training stage, also the result of a variety of static natures and Preliminary detection is made It goes to train final classification and Detection model for set.
Model carries out operation code parsing using open source script engine Zend, to the operation code in the operation code file of generation Sequence extracts.It has ability in feature extraction, can extract a variety of static statistics features in data sample, then these are special Sign is put into characteristic set, can handle it by textual classification model with sequence of opcodes according to the static nature in data sample Result afterwards is trained RF detection model.
To achieve the goals above, which adopts the technical scheme that based on random forest and FastText Webshell detection model mainly includes three parts: data preparation, feature extraction, Webshell detection.
Data preparation mainly completes the Data Preparation before operation code and statistical nature extraction operation, including collects just Negative sample, repeating sample this document is filtered, marks positive negative sample etc..
Static statistics feature extraction includes longest string length, comentropy, is overlapped index, characteristic function and blacklist pass Key word.The extraction of operation code is main to complete operation code generation and the extraction of sequence of opcodes preservation.
The RF algorithm that Webshell detection is mainly based upon 100 decision trees composition carries out the training of disaggregated model and unknown The classification of type sample determines.In the training stage, need to debug hyper parameter configuration in disaggregated model, it is optimal to train Disaggregated model.
Detailed description of the invention
Fig. 1 is model training and detection structure figure of the invention.
Specific embodiment
The Webshell detection model based on random forest and FastText includes three modules: data preparation module, Characteristic extracting module, Webshell detection module.
Data source.
As shown in Figure 1 be model training and detection structure figure, be illustrated in detail Webshell detection model training and Detect the process in two stages.By pre-processing to sample file, the static statistics feature of extraction document is complete using VLD At the generation of PHP operation code, textual classification model training is carried out to operation code by FastText and obtains corresponding marker characteristic, The feature that static statistics feature and operation code marker characteristic are combined as entire detection model is inputted.In the training stage, according to The characteristic set training that static statistics feature and marker characteristic are formed generates disaggregated model.In detection-phase, with same operation Process is completed to extract PHP file static nature to be measured, is then detected using the textual classification model that the training stage generates to be measured The operation code of PHP file, further according to the random forest detection model generated by the training stage to the characteristic set of PHP file to be measured Carry out classification judgement.
Wherein, FastText has an important parameter wordNgrams in natural language processing, is herein word grade N- Gram can allow single operation code to be combined into a packet sequence with remaining operation code group as a whole and be input to text classification mould Type is trained, can achieve with the comparable performance of method based on deep learning, but FastText speed is faster.FastText Model is made of matrix A and B, and matrix A is textual words look-up table, and matrix B is used for classifier.Word is averagely represented as text This, text representation is fed back again to linear classifier.The probability distribution of predefined class is obtained using softmax function f.For one group Size is the document of N, wherein n-th of document is xn, corresponding label isn.Formula is as follows:
Characteristic set is trained using random forest RF algorithm in disaggregated model, by objective function, majorized function, The debugging of the parameters such as batch-size, epochs trains disaggregated model optimal under current environment.When the new test specimens of input This when, carries out classification judgement to sample using trained disaggregated model.

Claims (4)

1. the invention discloses a kind of Webshell detection model based on random forest and FastText, feature includes following Step:
A, preprocessed data extracts five kinds of static natures of PHP file training set, and the element as characteristic set is in subsequent step It will be used to train Random Forest model;
B, the sequence of opcodes that PHP file is parsed using Vulcan Logic Disassembler (VLD) extension, will mark The sequence of opcodes crossed passes through the processing of FastText algorithm, generates FastText text classifier model;
C, the sequence of opcodes extracted in step B is input in text classifier and predicts its corresponding label, and be added to Characteristic set containing the static nature in step 1;
D, the characteristic set in step C is trained based on the sorting algorithm of random forest, generates binary classification model;
E, same steps are carried out to PHP file test set using the text prediction model and binary classification model generated in preceding four step Processing, obtains final predicted value.
WordNgrams parameter determines n=4 in N-gram in 2.FastText, obtains the general of predefined class using softmax function Rate distribution, optimizes the marking speed in training and prediction.
3. the tag along sort feature extraction and deep learning algorithm random forest structure according to claim 1 based on operation code The disaggregated model built, it is characterised in that: the static nature based on PHP file extracts, including the operation based on Zend analytics engine Code generates, and the text classifier model based on FastText after optimization generates and tag along sort extracts;Based on random forest RF's Disaggregated model determines, by the debugging to model hyper parameter, to train optimal Webshell PHP code disaggregated model.
4. the sorter model after the algorithm optimization according to claim 1 based on FastText, it is characterised in that: FastText can make full use of the classification feature of softmax function in natural language processing, traverse all leaf segments of classification tree Point finds the label of maximum probability, the enough situations of this method applicable data collection, the text classification for supervised learning Tasking learning speed ratio is very fast.
CN201811507276.3A 2018-12-11 2018-12-11 A kind of Webshell detection model based on random forest and FastText Pending CN109753798A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811507276.3A CN109753798A (en) 2018-12-11 2018-12-11 A kind of Webshell detection model based on random forest and FastText

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811507276.3A CN109753798A (en) 2018-12-11 2018-12-11 A kind of Webshell detection model based on random forest and FastText

Publications (1)

Publication Number Publication Date
CN109753798A true CN109753798A (en) 2019-05-14

Family

ID=66403665

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811507276.3A Pending CN109753798A (en) 2018-12-11 2018-12-11 A kind of Webshell detection model based on random forest and FastText

Country Status (1)

Country Link
CN (1) CN109753798A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210225A (en) * 2019-05-27 2019-09-06 四川大学 A kind of intelligentized Docker container malicious file detection method and device
CN112367336A (en) * 2020-11-26 2021-02-12 杭州安恒信息技术股份有限公司 Webshell interception detection method, device, equipment and readable storage medium
CN113051559A (en) * 2021-03-22 2021-06-29 山西三友和智慧信息技术股份有限公司 Edge device web attack detection system and method based on distributed deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
方勇等: "《PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON COMPUTING AND ARTIFICIAL INTELLIGENCE (ICCAI 2018)》", 14 March 2018 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110210225A (en) * 2019-05-27 2019-09-06 四川大学 A kind of intelligentized Docker container malicious file detection method and device
CN112367336A (en) * 2020-11-26 2021-02-12 杭州安恒信息技术股份有限公司 Webshell interception detection method, device, equipment and readable storage medium
CN113051559A (en) * 2021-03-22 2021-06-29 山西三友和智慧信息技术股份有限公司 Edge device web attack detection system and method based on distributed deep learning

Similar Documents

Publication Publication Date Title
CN110351301B (en) HTTP request double-layer progressive anomaly detection method
CN109547423B (en) WEB malicious request deep detection system and method based on machine learning
CN106357618B (en) Web anomaly detection method and device
CN109190372B (en) JavaScript malicious code detection method based on bytecode
CN109784056B (en) Malicious software detection method based on deep learning
CN108875366A (en) A kind of SQL injection behavioral value system towards PHP program
CN102411563A (en) Method, device and system for identifying target words
CN112307473A (en) Malicious JavaScript code detection model based on Bi-LSTM network and attention mechanism
CN103577755A (en) Malicious script static detection method based on SVM (support vector machine)
CN109753798A (en) A kind of Webshell detection model based on random forest and FastText
CN117081858B (en) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN107239694A (en) A kind of Android application permissions inference method and device based on user comment
CN115361176B (en) SQL injection attack detection method based on FlexUDA model
Mimura et al. Using LSI to detect unknown malicious VBA macros
CN114398891B (en) Method for generating KPI curve and marking wave band characteristics based on log keywords
CN115277180A (en) Block chain log anomaly detection and tracing system
CN115758183A (en) Training method and device for log anomaly detection model
CN108647497A (en) A kind of API key automatic recognition systems of feature based extraction
CN112257425A (en) Power data analysis method and system based on data classification model
CN112257076A (en) Vulnerability detection method based on random detection algorithm and information aggregation
CN109918638B (en) Network data monitoring method
CN108717637B (en) Automatic mining method and system for E-commerce safety related entities
CN113657443B (en) On-line Internet of things equipment identification method based on SOINN network
CN113688240A (en) Threat element extraction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190514