CN109753798A - A kind of Webshell detection model based on random forest and FastText - Google Patents
A kind of Webshell detection model based on random forest and FastText Download PDFInfo
- Publication number
- CN109753798A CN109753798A CN201811507276.3A CN201811507276A CN109753798A CN 109753798 A CN109753798 A CN 109753798A CN 201811507276 A CN201811507276 A CN 201811507276A CN 109753798 A CN109753798 A CN 109753798A
- Authority
- CN
- China
- Prior art keywords
- model
- fasttext
- webshell
- random forest
- php
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Remote access wooden horse based on Web is a kind of tool for network intrusions, can upload to website to access Web service administration authority.Once attacker successfully injects, huge destruction will result in, therefore it is most important that Webshell is effectively detected.By using obfuscation, Webshell has flexibility and changeability, and which increase the difficulty of detection.The invention proposes a kind of PHP Webshell detection model, which is based on FastText algorithm and random forests algorithm, referred to as FRF-WD, and the important feature that PHP sequence of opcodes will be detected as Webshell.The model that the present invention designs verification and measurement ratio with higher and lower rate of false alarm.
Description
Technical field
The present invention designs a kind of PHP Webshell detection model based on random forest Yu FastText algorithm.The model
By extracting the sequence of opcodes based on Zend engine, label is extracted after carrying out textual classification model differentiation to sequence of opcodes,
It is accurately and effectively detected using the sorting algorithm realization based on random forest (Random Forest) using PHP language
Webshell malicious script.
Background technique
With the development that Web is applied, the remote access wooden horse (Webshell) based on Web becomes a kind of and is used for network intrusions
Tool, attacker can be uploaded to Web server with obtain access service management permission.Once attacker successfully infuses
Enter, using the fragility of server, will cause huge loss, therefore it is vital that Webshell, which is effectively detected,.
Webshell is had the feature that flexibly and can be changed, is increased the difficulty of detection with this by using obfuscation.This paper presents one
Kind uses the Webshell detection model of PHP language, models coupling FastText and random forests algorithm, referred to as FRF-WD.
Wherein important feature of the PHP sequence of opcodes as detection Webshell.The experimental results showed that model detection with higher
Rate and lower rate of false alarm, it was demonstrated that the feasibility and validity of the model.
Detection to Webshell is essential a part in malicious web pages detection, and existing many research at
Fruit.Two factors of most important one are exactly feature extraction and detection model.The Webshell feature extracted from different perspectives
Method substantially there are five types of, be the longest string length of file respectively, comentropy is overlapped index, and characteristic function and blacklist close
Key word.
Optimal threshold recognition methods based on malice function and malice feature samples can be used for Webshell detection, but may
The legitimate files comprising a small amount of malice function apocrypha can be identified as to cause to judge by accident.
Similarity degree based on similar matrix analysis sample set detects PHP Malware.This method is by using four kinds
Different similarities: content, the title of user-defined function and the file of decoding sample extraction user-defined function body are fuzzy
Hash carries out similarity analysis to PHP Malware sample set.
Webshell detection model research based on machine learning is extensive.Such as the Webshell detection based on matrix decomposition
Method.But this method does not determine whether to classify to page properties.
Method based on support vector machines (SVM).This method is analyzed by the HTML characteristic to webpage, compares support
Two kernel functions in vector machine: linear kernel function and gaussian radial basis function, the former has higher recall rate.
Webshell is detected using Web log, since Webshell is usually an individual file, Main Analysis file
Access path and parameter, the frequency and page relevance that access file compare the difference between Webshell and normal Web document.
However the method for only using these features may have very high false positive rate, as a result there is reduced possibility in accuracy rate.
It is also useful in addition to the above method there are also the method for combining static and dynamic state technology to disclose Webshell feature
How honey jar research attacker utilizes Webshell.
Detection technique is also from traditional pattern match, machine learning like a raging fire till now, the inspection for Webshell
Direction finding more automatic, intelligence direction develop, the requirement to testing result is also not only accurate, exhaustive to identify
Know the attack of type, more wants to fight and various obscure means.
The problem that feature extraction and detection for Webshell mainly solve is.
(1) how the extraction that sequence of opcodes is carried out based on Zend engine is carried out to the Webshell of PHP file edit.
(2) how parameter selection and optimization are carried out so that its generation is suitable for current operation code sequence to FastText algorithm
Textual classification model.
(3) how to construct suitable machine learning algorithm and the detection effect of PHP Webshell is tested.
This system emphasis solves three above problem, realizes that the Webshell based on operation code detects mould
Type.
Summary of the invention
The invention is to carry out operation code parsing, based on word- in FastText algorithm using open source script engine Zend
Ngram concept carries out textual classification model training, a variety of static natures based on statistics extract, based on machine learning algorithm RF into
The advanced model of the multinomial technological development of row classification.By being pre-processed to sample data, to Webshell code therein
It carries out static nature and sequence of opcodes is extracted, the Webshell in PHP file is detected using RF disaggregated model.
The invention aims at following target.
(1) the operation code opcodes that model generates the PHP file compiling generated by VLD extension, using FastText
Model carries out feature extraction to sequence of opcodes and classifies, and detects currently to input whether PHP file is Webshell.
(2) model can pre-process the sample code being collected into, and can extract the operation code of PHP file simultaneously
Text classification is carried out, as preliminarily detection-phase.In the training stage, also the result of a variety of static natures and Preliminary detection is made
It goes to train final classification and Detection model for set.
Model carries out operation code parsing using open source script engine Zend, to the operation code in the operation code file of generation
Sequence extracts.It has ability in feature extraction, can extract a variety of static statistics features in data sample, then these are special
Sign is put into characteristic set, can handle it by textual classification model with sequence of opcodes according to the static nature in data sample
Result afterwards is trained RF detection model.
To achieve the goals above, which adopts the technical scheme that based on random forest and FastText
Webshell detection model mainly includes three parts: data preparation, feature extraction, Webshell detection.
Data preparation mainly completes the Data Preparation before operation code and statistical nature extraction operation, including collects just
Negative sample, repeating sample this document is filtered, marks positive negative sample etc..
Static statistics feature extraction includes longest string length, comentropy, is overlapped index, characteristic function and blacklist pass
Key word.The extraction of operation code is main to complete operation code generation and the extraction of sequence of opcodes preservation.
The RF algorithm that Webshell detection is mainly based upon 100 decision trees composition carries out the training of disaggregated model and unknown
The classification of type sample determines.In the training stage, need to debug hyper parameter configuration in disaggregated model, it is optimal to train
Disaggregated model.
Detailed description of the invention
Fig. 1 is model training and detection structure figure of the invention.
Specific embodiment
The Webshell detection model based on random forest and FastText includes three modules: data preparation module,
Characteristic extracting module, Webshell detection module.
Data source.
As shown in Figure 1 be model training and detection structure figure, be illustrated in detail Webshell detection model training and
Detect the process in two stages.By pre-processing to sample file, the static statistics feature of extraction document is complete using VLD
At the generation of PHP operation code, textual classification model training is carried out to operation code by FastText and obtains corresponding marker characteristic,
The feature that static statistics feature and operation code marker characteristic are combined as entire detection model is inputted.In the training stage, according to
The characteristic set training that static statistics feature and marker characteristic are formed generates disaggregated model.In detection-phase, with same operation
Process is completed to extract PHP file static nature to be measured, is then detected using the textual classification model that the training stage generates to be measured
The operation code of PHP file, further according to the random forest detection model generated by the training stage to the characteristic set of PHP file to be measured
Carry out classification judgement.
Wherein, FastText has an important parameter wordNgrams in natural language processing, is herein word grade N-
Gram can allow single operation code to be combined into a packet sequence with remaining operation code group as a whole and be input to text classification mould
Type is trained, can achieve with the comparable performance of method based on deep learning, but FastText speed is faster.FastText
Model is made of matrix A and B, and matrix A is textual words look-up table, and matrix B is used for classifier.Word is averagely represented as text
This, text representation is fed back again to linear classifier.The probability distribution of predefined class is obtained using softmax function f.For one group
Size is the document of N, wherein n-th of document is xn, corresponding label isn.Formula is as follows:
Characteristic set is trained using random forest RF algorithm in disaggregated model, by objective function, majorized function,
The debugging of the parameters such as batch-size, epochs trains disaggregated model optimal under current environment.When the new test specimens of input
This when, carries out classification judgement to sample using trained disaggregated model.
Claims (4)
1. the invention discloses a kind of Webshell detection model based on random forest and FastText, feature includes following
Step:
A, preprocessed data extracts five kinds of static natures of PHP file training set, and the element as characteristic set is in subsequent step
It will be used to train Random Forest model;
B, the sequence of opcodes that PHP file is parsed using Vulcan Logic Disassembler (VLD) extension, will mark
The sequence of opcodes crossed passes through the processing of FastText algorithm, generates FastText text classifier model;
C, the sequence of opcodes extracted in step B is input in text classifier and predicts its corresponding label, and be added to
Characteristic set containing the static nature in step 1;
D, the characteristic set in step C is trained based on the sorting algorithm of random forest, generates binary classification model;
E, same steps are carried out to PHP file test set using the text prediction model and binary classification model generated in preceding four step
Processing, obtains final predicted value.
WordNgrams parameter determines n=4 in N-gram in 2.FastText, obtains the general of predefined class using softmax function
Rate distribution, optimizes the marking speed in training and prediction.
3. the tag along sort feature extraction and deep learning algorithm random forest structure according to claim 1 based on operation code
The disaggregated model built, it is characterised in that: the static nature based on PHP file extracts, including the operation based on Zend analytics engine
Code generates, and the text classifier model based on FastText after optimization generates and tag along sort extracts;Based on random forest RF's
Disaggregated model determines, by the debugging to model hyper parameter, to train optimal Webshell PHP code disaggregated model.
4. the sorter model after the algorithm optimization according to claim 1 based on FastText, it is characterised in that:
FastText can make full use of the classification feature of softmax function in natural language processing, traverse all leaf segments of classification tree
Point finds the label of maximum probability, the enough situations of this method applicable data collection, the text classification for supervised learning
Tasking learning speed ratio is very fast.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811507276.3A CN109753798A (en) | 2018-12-11 | 2018-12-11 | A kind of Webshell detection model based on random forest and FastText |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811507276.3A CN109753798A (en) | 2018-12-11 | 2018-12-11 | A kind of Webshell detection model based on random forest and FastText |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109753798A true CN109753798A (en) | 2019-05-14 |
Family
ID=66403665
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811507276.3A Pending CN109753798A (en) | 2018-12-11 | 2018-12-11 | A kind of Webshell detection model based on random forest and FastText |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109753798A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210225A (en) * | 2019-05-27 | 2019-09-06 | 四川大学 | A kind of intelligentized Docker container malicious file detection method and device |
CN112367336A (en) * | 2020-11-26 | 2021-02-12 | 杭州安恒信息技术股份有限公司 | Webshell interception detection method, device, equipment and readable storage medium |
CN113051559A (en) * | 2021-03-22 | 2021-06-29 | 山西三友和智慧信息技术股份有限公司 | Edge device web attack detection system and method based on distributed deep learning |
-
2018
- 2018-12-11 CN CN201811507276.3A patent/CN109753798A/en active Pending
Non-Patent Citations (1)
Title |
---|
方勇等: "《PROCEEDINGS OF 2018 INTERNATIONAL CONFERENCE ON COMPUTING AND ARTIFICIAL INTELLIGENCE (ICCAI 2018)》", 14 March 2018 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110210225A (en) * | 2019-05-27 | 2019-09-06 | 四川大学 | A kind of intelligentized Docker container malicious file detection method and device |
CN112367336A (en) * | 2020-11-26 | 2021-02-12 | 杭州安恒信息技术股份有限公司 | Webshell interception detection method, device, equipment and readable storage medium |
CN113051559A (en) * | 2021-03-22 | 2021-06-29 | 山西三友和智慧信息技术股份有限公司 | Edge device web attack detection system and method based on distributed deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110351301B (en) | HTTP request double-layer progressive anomaly detection method | |
CN109547423B (en) | WEB malicious request deep detection system and method based on machine learning | |
CN106357618B (en) | Web anomaly detection method and device | |
CN109190372B (en) | JavaScript malicious code detection method based on bytecode | |
CN109784056B (en) | Malicious software detection method based on deep learning | |
CN108875366A (en) | A kind of SQL injection behavioral value system towards PHP program | |
CN102411563A (en) | Method, device and system for identifying target words | |
CN112307473A (en) | Malicious JavaScript code detection model based on Bi-LSTM network and attention mechanism | |
CN103577755A (en) | Malicious script static detection method based on SVM (support vector machine) | |
CN109753798A (en) | A kind of Webshell detection model based on random forest and FastText | |
CN117081858B (en) | Intrusion behavior detection method, system, equipment and medium based on multi-decision tree | |
CN110191096A (en) | A kind of term vector homepage invasion detection method based on semantic analysis | |
CN107239694A (en) | A kind of Android application permissions inference method and device based on user comment | |
CN115361176B (en) | SQL injection attack detection method based on FlexUDA model | |
Mimura et al. | Using LSI to detect unknown malicious VBA macros | |
CN114398891B (en) | Method for generating KPI curve and marking wave band characteristics based on log keywords | |
CN115277180A (en) | Block chain log anomaly detection and tracing system | |
CN115758183A (en) | Training method and device for log anomaly detection model | |
CN108647497A (en) | A kind of API key automatic recognition systems of feature based extraction | |
CN112257425A (en) | Power data analysis method and system based on data classification model | |
CN112257076A (en) | Vulnerability detection method based on random detection algorithm and information aggregation | |
CN109918638B (en) | Network data monitoring method | |
CN108717637B (en) | Automatic mining method and system for E-commerce safety related entities | |
CN113657443B (en) | On-line Internet of things equipment identification method based on SOINN network | |
CN113688240A (en) | Threat element extraction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190514 |