CN107888616A - The detection method of construction method and Webshell the attack website of disaggregated model based on URI - Google Patents
The detection method of construction method and Webshell the attack website of disaggregated model based on URI Download PDFInfo
- Publication number
- CN107888616A CN107888616A CN201711276201.4A CN201711276201A CN107888616A CN 107888616 A CN107888616 A CN 107888616A CN 201711276201 A CN201711276201 A CN 201711276201A CN 107888616 A CN107888616 A CN 107888616A
- Authority
- CN
- China
- Prior art keywords
- uri
- access
- data
- website
- webshell
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computer And Data Communications (AREA)
Abstract
The invention discloses a kind of construction method of the disaggregated model based on URI, performed in computing device, including:It is that normally the access log of access website and Webshell attacks website is as positive sample data and negative sample data to obtain a plurality of have confirmed that respectively, wherein every access log includes the access data asked the URI of resource and associated with the URI;Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively, multiple URI characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of URI characteristic values are configured to a URI characteristic vector;Respectively according to the URI characteristic vectors of each URI in positive/negative sample data and its corresponding positive sample mark the first positive/negative sample set of generation, and the first training set is generated according to the two sample sets;Using the URI characteristic vectors of each sample in first training set as input, using its sample identification as output, the first training set is trained using pre-defined algorithm, obtains the disaggregated model based on URI.
Description
Technical field
The present invention relates to Internet technical field, more particularly to a kind of construction method of the disaggregated model based on URI,
Webshell attacks the detection method and computing device of website.
Background technology
Webshell is a kind of order performing environment existing in the form of the web page files such as asp, php, jsp, cgi, also may be used
To be referred to as a kind of webpage back door.Invader often places Webshell after website is invaded in the WEB catalogues of WEB server
Backdoor file, and mixed with normal file under WEB server WEB catalogues, it is not easy to be found.Invader can use WEB
Mode accesses Webshell and obtains order performing environment to reach the purpose of control website or WEB server, the operation that can be carried out
Including uploading download file, checking database, execution random procedure order etc..
The data that distance host exchanges are transmitted by 80 ports, therefore will not be intercepted by fire wall.And use
Webshell will not typically leave record in system journal, and the submission of some data can be only left in the daily record of WEB server
Record, it is difficult to find out invasion vestige that unfamiliar keeper, which is,.
It is that rule-based and feature database is detected mostly, such as in the existing WebShell detection methods to access log
By disclosed WebShell on collection network and its feature is analyzed, or adds some sensitivity functions and establishes WebShell features
Storehouse, these features or sensitivity function are then matched in the access log of website, if matched, carrying out manual confirmation, it is
No is WebShell.This detection depends on the accumulation of existing WebShell attacks, and can only detect known
Attack, corresponding unknown WebShell are difficult to find.
A kind of accordingly, it is desirable to provide more effective comprehensive WebShell detection methods.
The content of the invention
Therefore, the present invention provides the detection of a kind of construction method of the disaggregated model based on URI, Webshell attacks website
Method and computing device, exist above to try hard to solve the problems, such as or at least alleviate.
According to an aspect of the invention, there is provided a kind of construction method of the disaggregated model based on URI, in computing device
Middle execution, is adapted for distinguishing between the normal URI for accessing website and the doubtful URI that website is attacked by Webshell, and this method includes:Respectively
It is the normal access log for accessing website as positive sample data to obtain a plurality of have confirmed that, and a plurality of is had confirmed that as Webshell
Attack website access log be used as negative sample data, wherein every access log include request resource URI and with the URI
Associated access data;Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively,
Multiple URI characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of URI characteristic values are constructed
For a URI characteristic vector;According to the URI characteristic vectors of each URI in positive sample data and its mark generation of corresponding positive sample
First positive sample collection, and according to the URI characteristic vectors of each URI in negative sample data and its corresponding negative sample mark generation the
One negative sample collection;And the first training set is generated according to the first positive sample collection and the first negative sample collection, and with first training set
In each sample URI characteristic vectors for input, using its sample identification for export, the first training set is instructed using pre-defined algorithm
Practice, obtain the disaggregated model based on URI.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, the access data of access log
Including the one or more in following parameter:Ask IP, requesting method, the conditional code of request return, the CDN hit shapes of user
In state, the attack type of fire wall detection, required parameter, beginning request time and request message length.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, multiple URI characteristic values include
One or more in following characteristics value:Access and returned in URI client ip quantity, access URI total degree, access URI
Frequency of failure ratio, access in URI and asked by whether WAF interception requests ratio, the URI accessed have in hit CDN, access URI
Parameters variation number.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, URI client ip is accessed
Quantity is suitable to be calculated according to the IP of request user;The total degree for accessing URI is suitable to conditional code or the fire wall returned according to request
The attack type of detection calculates;Access and frequency of failure ratio is returned in URI suitable for being calculated according to the conditional code of request return;Access
Calculated in URI by the attack type that fire wall interception request ratio is suitable to be detected according to fire wall;Whether the URI of access has hit
CDN is suitable to be determined according to CDN hit conditions;And access required parameter change frequency in URI and be suitable to be calculated according to required parameter.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, according to a plurality of access log
Access data the step of calculating multiple URI characteristic values of the URI include:By positive sample data and negative sample data according to each word
The implication of section is converted into data frame;And polymerize the data frame according to URI, the data row of each access data are obtained, and from each
URI characteristic values corresponding to extraction in data row;Wherein, IP, CDN hit condition of user, required parameter is asked to be suitable to use
Collect_set methods generation data row, the attack type of the conditional code and fire wall detection returned is asked to be suitable to use
Collect_list methods generation data row.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, in addition to step:According to
One positive sample collection and the first negative sample collection generation the first checking collection;The URI characteristic vectors that each sample is concentrated in first checking are inputted
Into the disaggregated model based on URI, prediction obtains the sample identification of each sample;And the sample mark that obtained each sample will be predicted
Knowledge compares with its actual sample identification, calculates the accuracy of the disaggregated model based on URI.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, the first training set and first is tested
Card collection is suitable to be generated according to following methods:The first positive sample collection and the first negative sample collection are randomly divided into two groups respectively;And appoint
The first training set is used as after selecting one of which the first positive sample collection and one group of first negative sample collection merging, and by another group first just
Sample set and another group of the first negative sample collection collect after merging as the first checking.
According to another aspect of the present invention, there is provided a kind of detection method of Webshell attacks website, suitable for calculating
Performed in equipment, the disaggregated model based on URI, this method as described above are stored with computing device to be included:Obtain pre- timing
A plurality of access log to be confirmed in section, wherein what URI and the URI that every access log includes request resource were associated
Access data;Extraction is directed to same URI a plurality of access log from a plurality of access log to be confirmed, according to a plurality of access
The access data of daily record calculate multiple URI characteristic values of the URI, and by the plurality of URI characteristic values be configured to a URI feature to
Amount;And each URI characteristic vectors of a plurality of access log to be confirmed are input in the disaggregated model based on URI, obtain it
Middle output result is the URI corresponding to the URI characteristic vectors of negative sample mark, and is marked as doubtful being attacked by Webshell
The URI of website.
Alternatively, attacked in the detection method of website according to the Webshell of the present invention, be also stored with computing device
Disaggregated model based on access sequence, the model are suitable to distinguish scanner visit from the doubtful URI for being attacked website by Webshell
Ask the URI of website and the URI of website is attacked by Webshell, it is suitable to build with the following method:Obtain it is a plurality of have confirmed that for
The normal access log for accessing website is used as positive sample data, and a plurality of has confirmed that as the access day of Webshell attacks website
Will is denoted as negative sample data;Same URI a plurality of access day is directed to from positive sample data and negative sample extracting data respectively
Will, calculates multiple access sequence characteristic values of the URI according to the access data of a plurality of access log, and by the plurality of access sequence
Row characteristic value is configured to an access sequence characteristic vector;According to the access sequence characteristic vector of each URI in positive sample data and
Its corresponding positive sample mark second positive sample collection of generation, and according to the access sequence feature of each URI in negative sample data to
Amount and its corresponding negative sample mark the second negative sample collection of generation;And generated according to the second positive sample collection and the second negative sample collection
First training set, and using the access sequence characteristic vector of each sample in second training set as input, using its sample identification to be defeated
Go out, the second training set is trained using pre-defined algorithm, obtains the disaggregated model based on access sequence.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, in addition to step:Obtain doubtful
By a plurality of original log corresponding to the URI of Webshell attacks website;Extraction is for same URI's from a plurality of original log
A plurality of access log, multiple access sequence characteristic values of the URI are calculated according to the access data of a plurality of access log, and should
Multiple access sequence characteristic values are configured to an access sequence characteristic vector;It is and each access sequence of a plurality of original log is special
Sign vector is input in the disaggregated model based on access sequence, and it is special to obtain the access sequence that wherein output result is negative sample mark
The URI corresponding to vector is levied, and is marked as by the URI of Webshell attacks website.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, access sequence characteristic value includes
One or more in following characteristics value:GET/POST requests ratio, the successful ratio of request, access time interval average, visit
Ask time interval variance, request message length average and request message length variance.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, GET/POST requests ratio is fitted
Calculated according to requesting method;The conditional code for asking successful ratio to be suitable to be returned according to request calculates;Access time, interval was equal
Value and access time interval variance are adapted to be determined according to beginning request time;Request message length average and request message length
Variance is adapted to be determined according to request message length.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, in addition to a plurality of to be confirmed
Access log carry out pretreatment operation the step of:Static path, white list road are filtered out from a plurality of access log respectively
Access log corresponding to footpath and non-Webshell suffix paths.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, the disaggregated model based on URI
In output, the URI that represents corresponding to the sample of negative sample mark is the doubtful URI that website is attacked by Webshell, positive sample mark
It is the normal URI for accessing website to know the URI represented corresponding to the sample;In the output of disaggregated model based on access sequence, bear
The URI that sample identification represents corresponding to the sample is to represent the sample institute by the URI of Webshell attacks website, positive sample mark
Corresponding URI is the URI that scanner accesses website.
According to another aspect of the present invention, there is provided a kind of computing device, including:One or more processors;Memory;
With one or more programs, wherein one or more of program storages are in the memory and are configured as by one
Or multiple computing devices, one or more of programs include being used to perform the disaggregated model based on URI as described above
The instruction of the detection method of construction method and Webshell attacks website.
In accordance with a further aspect of the present invention, there is provided a kind of computer-readable storage medium for storing one or more programs
Matter, one or more of programs include instruction, and the instruction is when computing device so that the computing device is such as
The construction method of the upper described disaggregated model based on URI and the detection method of Webshell attacks website.
Technique according to the invention scheme, established based on normal access log and the Webshell attack logs having confirmed that
One disaggregated model based on URI, some can be have identified from the network access daily record of unknown properties by the model and doubted
The URI of website is seemingly attacked by Webshell, and then may filter that out the doubtful access log that website is attacked by Webshell.So may be used
Massive logs is become minimal amount of data to be confirmed, so as to substantially reduce the cost of manual confirmation, improve WebShell
Detection efficiency.
Further, the present invention establishes a disaggregated model based on access sequence again, can be from that by the model
A little doubtful attacked by Webshell in the access log of website identify really really Webshell attack access daily records, so as to
Realize the further locking to attack logs.Excavated by this two-stage classification algorithm model among network log by success
The URI of WebShell attacks, so as to further reduce data volume to be confirmed, and improve WebShell detection efficiencies.
Brief description of the drawings
In order to realize above-mentioned and related purpose, some illustrative sides are described herein in conjunction with following description and accompanying drawing
Face, these aspects indicate the various modes that can put into practice principles disclosed herein, and all aspects and its equivalent aspect
It is intended to fall under in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned
And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical reference generally refers to identical
Part or element.
Fig. 1 shows the schematic diagram of network request processing system 100 according to an embodiment of the invention;
Fig. 2 shows the schematic diagram of computing device 200 according to an embodiment of the invention;
Fig. 3 shows the flow of the construction method 300 of the disaggregated model according to an embodiment of the invention based on URI
Figure;
Fig. 4 shows the flow chart of the detection method 400 of Webshell attacks website according to an embodiment of the invention;
Fig. 5 shows the construction method 500 of the disaggregated model according to an embodiment of the invention based on access sequence
Flow chart;And
Fig. 6 shows the flow of the detection method 600 of Webshell attacks website in accordance with another embodiment of the present invention
Figure.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
Fig. 1 shows the schematic diagram of network request processing system 100 according to an embodiment of the invention.As shown in figure 1,
The system includes multiple client 110 (such as client 1-m), CDN (Content Delivery Network, i.e. content distribution
Network) node 120, fire wall (WAF) node 130, data storage device 140, computing device 150 and multiple network stations of rear end
160 (such as websites 1 and 2) of point.Wherein, data storage device 140 and CDN node 120,130 equal communication link of firewall node
Connect, computing device 150 communicates to connect with data storage device 140.It should be pointed out that the network request processing system 100 in Fig. 1 is only
It is exemplary, the client, CDN node, WAF that can have varying number in specific practice situation, in system 100 save
Point, data storage device and computing device, the present invention are not limited to each number of devices included by system 100.
Client 110 can be implemented as the web browser that WWW uses, or client software of instant messaging etc.,
It is generally arranged on personal computer, mobile phone, tablet personal computer, personal media player device, wireless network browsing apparatus, should
With in the client computer such as special equipment.Data storage device 140 can be resided in computing device 150 as local data base,
It can be arranged on as remote data base outside computing device 150, be also used as distributed data base and set such as HBase
In multiple geographical locations.In a word, data storage device 140 is used for data storage, but the present invention is to data storage device 140
Concrete configuration situation be not limited.Computing device 150 can be deposited by wired or wireless mode via internet to data
Data in storage device 140 are written and read.Generally, all network access daily records are collected in data storage device 140, calculating is set
Standby 150 obtain corresponding access log from the data storage device, and to wherein belonging to the network log of Webshell attacks
It is identified.
Fig. 2 is the block diagram of Example Computing Device 200.In basic configuration 202, computing device 200, which typically comprises, is
System memory 206 and one or more processor 204.Memory bus 208 can be used in processor 204 and system storage
Communication between device 206.
Depending on desired configuration, processor 204 can be any kind of processing, include but is not limited to:Microprocessor
(μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 204 can be included such as
The cache of one or more rank of on-chip cache 210 and second level cache 212 etc, processor core
214 and register 216.The processor core 214 of example can include arithmetic and logical unit (ALU), floating-point unit (FPU),
Digital signal processing core (DSP core) or any combination of them.The Memory Controller 218 of example can be with processor
204 are used together, or in some implementations, Memory Controller 218 can be an interior section of processor 204.
Depending on desired configuration, system storage 206 can be any type of memory, include but is not limited to:Easily
The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System stores
Device 206 can include operating system 220, one or more apply 222 and routine data 224.In some embodiments,
It may be arranged to be operated using routine data 224 on an operating system using 222.Routine data 224 includes instruction, in root
In computing device 200 according to the present invention, routine data 224 includes the construction method for being used for performing the disaggregated model based on URI
300th, in the detection method 400 and 600 of the construction method 500 of the disaggregated model based on access sequence and Webshell attacks website
The instruction of at least one method.
Computing device 200 can also include contributing to from various interface equipments (for example, output equipment 242, Peripheral Interface
244 and communication equipment 246) to basic configuration 202 via the communication of bus/interface controller 230 interface bus 240.Example
Output equipment 242 include graphics processing unit 2148 and audio treatment unit 250.They can be configured as contributing to via
One or more A/V port 252 is communicated with the various external equipments of such as display or loudspeaker etc.Outside example
If interface 244 can include serial interface controller 254 and parallel interface controller 256, they can be configured as contributing to
Via one or more I/O port 258 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touch
Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.The communication of example is set
Standby 246 can include network controller 260, and it can be arranged to be easy to via one or more COM1 264 and one
The communication that other individual or multiple computing devices 262 pass through network communication link.
Network communication link can be an example of communication media.Communication media can be generally presented as in such as carrier wave
Or computer-readable instruction in the modulated data signal of other transmission mechanisms etc, data structure, program module, and can
With including any information delivery media." modulated data signal " can such signal, one in its data set or more
It is individual or it change can the mode of coding information in the signal carry out.As nonrestrictive example, communication media can be with
Include the wire medium of such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared
(IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein can include depositing
Both storage media and communication media.
Computing device 200 can be implemented as server, such as file server, database server, application program service
Device and WEB server etc., a part for portable (or mobile) electronic equipment of small size, these electronic equipments can also be embodied as
Can be such as cell phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, individual
Helmet, application specific equipment or the mixing apparatus that any of the above function can be included.Computing device 200 can also be real
It is now to include desktop computer and the personal computer of notebook computer configuration.In certain embodiments, the quilt of computing device 200
It is configured to perform the construction method 300 of the disaggregated model based on URI according to the present invention, disaggregated model based on access sequence
At least one of the detection method 400 and 600 of construction method 500 and Webshell attacks website method.
In addition, the disaggregated model based on URI and the disaggregated model based on access sequence are also stored with computing device 200.
Wherein, based on log feature of the URI disaggregated model based on URI, using Logic Regression Models, suitable for will normally access website
URI and the doubtful URI that website is attacked by Webshell are distinguished.Disaggregated model based on access sequence is suitable to from doubtful quilt
The URI that scanner accesses the URI of website and website is attacked by Webshell is distinguished in the URI of Webshell attacks website.
Fig. 3 shows the flow of the construction method 300 of the disaggregated model according to an embodiment of the invention based on URI
Figure, suitable for being performed in computing device, is such as performed in computing device 200.As described in Figure 3, this method starts from step S320.
In step s 320, it is normally to access the access log of website as positive sample number to obtain a plurality of have confirmed that respectively
According to, and it is a plurality of have confirmed that the access log that website is attacked for Webshell is denoted as negative sample data, wherein every access day
Will includes asking the URI of resource and the access data associated with the URI.
Wherein, the access data of access log include the one or more in following parameter:Ask the IP of user
(client_ip), requesting method (method, GET or POST), the conditional code (resp_code) of request return, CDN hits
State (hit_status), fire wall detection attack type (attack_type), required parameter (query), start request when
Between (start_time) and request message length (req_bytes).It should be noted that in the structure of the disaggregated model based on URI
Build middle method actually and requesting method need not be used, start request time and request message length this three parameters, therefore
This three access data and wouldn't can also obtain.These access the log file formats that data are formed can be as shown in table 1:
Table 1
URI | client_ip | method | resp_code | hit_status | attack_type | query | start_time | req_bytes |
According to one embodiment of present invention, after getting original access log, first these access logs can also be entered
Row pretreatment operation.Specifically, static path, white list path can be filtered out from both original access logs respectively
And/or the access log corresponding to non-Webshell suffix paths.
Then, in step S340, respectively from positive sample data and negative sample extracting data for a plurality of of same URI
Access log, multiple URI characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of URI is special
Value indicative is configured to a URI characteristic vector.
It is, there is a plurality of access log in positive sample data, every access log has a URI, thus can be from
A plurality of access log of the extraction for same URI in all access logs of positive sample data.For example, have in positive sample data
1000 access logs, wherein have 50 for URI-1 access log, and this 50 access logs have a such as institute of table 1
The access data shown.So, after the access data for URI-1 this 50 access logs are collected, so that it may obtain
URI-1 multiple URI characteristic values.
Wherein, multiple URI characteristic values include the one or more in following characteristics value:
1) URI client ip quantity is accessed:Within a sufficiently long cycle, the client terminal quantity one of normal network address is accessed
The fixed client terminal quantity higher than access WebShell, it is suitable to be calculated according to the IP of request user;
2) URI total degree is accessed:Within a sufficiently long cycle, the quantity of normal network address is accessed necessarily higher than access
The quantity of WebShell network address, the attack type calculating that it is suitable to the conditional code returned according to request or fire wall detects;
3) access and frequency of failure ratio is returned in URI:Within a sufficiently long cycle, access normal network address and access failure
For ratio necessarily less than the ratio for accessing the access failure of WebShell network address, it is suitable to conditional code calculating returned according to request;
4) access in URI by WAF interception request ratios:Due to WAF presence, within a sufficiently long cycle, access
Normal network address is intercepted ratio and is necessarily intercepted ratio less than access WebShell network address.Even if WebShell can bypass WAF, always
Intercepted situation is had, its attack type for being suitable to detect according to fire wall calculates;
5) whether the URI accessed has hit CDN:WebShell attack access will not necessarily hit CDN, and it is suitable to basis
CDN hit conditions determine;Here, may be provided with hit is 1, misses as 0, may also be certainly and be arranged to other numerical value, this hair
It is bright that this is not restricted.
6) required parameter change frequency in URI is accessed:The required parameter of WebShell attack access is not under normal circumstances
Disconnected change, it is suitable to be calculated according to required parameter.
Specifically, can be by positive sample data or negative sample data point in the calculating process for carrying out multiple URI characteristic values
Do not imported from journal file, data frame (DataFrame) is changed into according to the implication of each field.Then will be generated
DataFrame polymerize according to URI, obtains the data row of each access data.Wherein client_ip, hit_status and query make
New row collect_set (client_ip), collect_set (hit_ corresponding to collect_set methods generate respectively
Status) given birth to respectively using collect_list methods with collect_set (query), resp_code and attack_type
Into corresponding new row collect_list (resp_code) and collect_list (attack_type).Wherein, collect_
The main distinction of set and collect_list methods is that the former can carry out duplicate removal processing to cited data item content, and
All data item then directly are set out by the latter.Here, data are accessed on client_ip etc., may for same URI
There is multiple access, certain possible client_ip accessed several times is identical in this multiple access, therefore is counted to its quantity
When need carry out duplicate removal processing.The form that new data row are generated according to URI polymerizations is as shown in table 2:
Table 2
Afterwards, you can utilize URI characteristic values corresponding to extraction in the new data row of each log feature in table 2.For example, utilize
Data row collect_set (client_ip) extracts feature 1), generate new row client_ip_count;Arranged using data
Collect_list (resp_code) or collect_list (attack_type) extractions feature 2), generate new row req_
count;Feature 3 is extracted using data row collect_list (resp_code)), generate new row resp_err_rate;Utilize
Data row collect_list (attack_type) extracts feature 4), generate new row waf_block_rate;Arranged using data
Collect_set (hit_status) extracts feature 5), generate new row hit_status;Utilize data row collect_set
(query) feature 6 is extracted), generate new row query_count.It should be appreciated that the extraction for each characteristic value, art technology
Personnel can be solved with sets itself formula or algorithm to corresponding each data row, the invention is not limited in this regard.Final
The column data format of each URI characteristic values arrived can be as shown in table 3:
Table 3
After getting each URI multiple URI characteristic values, you can the plurality of URI characteristic values are configured to URI characteristic vectors,
It can be sorted between these characteristic values according to any sortord, the invention is not limited in this regard.Assuming that above-mentioned the 6 of certain URI
Individual feature URI characteristic values are respectively 50,60,40%, 30%, 1 and 5, then the characteristic vector constructed by above-mentioned sortord is
{ 50,70,40%, 30%, 1,5 }.
Then, in step S360, each URI characteristic vectors and its corresponding positive sample that are constructed according to positive sample data
Mark the first positive sample collection of generation, and each URI characteristic vectors constructed according to negative sample data and its corresponding negative sample
Mark the first negative sample collection of generation.
It is, first sample concentrate each sample correspond to a URI, the URI corresponding to URI characteristic vectors and
Its sample identification, and each URI corresponds to a plurality of access log.First positive sample collection is that each URI generated by positive sample data is special
The set of sign vector and its positive sample mark, the first negative sample collection be each URI characteristic vectors for being generated by negative sample data and its
The set of negative sample mark.Usual positive sample mark can represent that negative sample mark can be represented with 0, naturally it is also possible to be adopted with 1
Positive and negative sample identification, the invention is not limited in this regard are represented with other numerical value.Table 4 shows first positive sample collection
DataFrame examples, wherein, URI1-n is to have confirmed that the URI for normal access website, and its sample identification is 1.
Table 4
Then, in step S380, the first training set is generated according to the first positive sample collection and the first negative sample collection, and with this
The URI characteristic vectors of each sample are input in first training set, using its sample identification as output, are instructed using pre-defined algorithm to first
Practice collection to be trained, obtain the disaggregated model based on URI.
According to one embodiment of present invention, the generation on the first training set, can respectively by the first positive sample collection with
Machine is divided into two groups, i.e. wDataFrame0 and wDataFrame1;First negative sample collection is also randomly divided into two groups, i.e.,
BDataFrame0 and bDataFrame1.Afterwards, optional one group of first positive sample collection and one group of first negative sample collection merge to obtain
First training set, such as wDataFrame0 and bDataFrame0 is selected to merge to obtain the first training set.In addition, in order to ensure to align
The equilibrium calculation of negative sample collection, negative sample collection can be aligned different weights is set, negative sample collection can generally set higher
Weight.According to one embodiment, the weight of negative sample collection can be arranged to 2, the weight of positive sample collection is arranged to 1.
It should be appreciated that pre-defined algorithm can be existing arbitrary classification algorithm or regression algorithm, such as logistic regression algorithm, branch
Vector machine classifier, Bayes classifier, maximum entropy classifiers etc. are held, it is specifically chosen to need depending on combining business scenario data,
Parameter required for each algorithm etc. can voluntarily be set by those skilled in the art, and the present invention is without limitation.
According to another embodiment of the invention, the disaggregated model based on URI constructed can also be verified.Tool
Body, collected according to the first positive sample collection and the checking of the first negative sample collection generation first, and the first checking is concentrated to the URI of each sample
Characteristic vector is input in the disaggregated model based on URI constructed, and prediction obtains the sample identification of each sample.Afterwards, will be pre-
The sample identification of each sample measured compares with its actual sample identification, calculates the accuracy of the disaggregated model based on URI.
Here the first checking collection can choose another group of the first positive sample collection and another group of the first negative sample collection merges to obtain, and such as will
WDataFrame1 and bDataFrame1 collects after merging as first checking.
After disaggregated model based on URI is built, you can doubtful quilt is identified from network access daily record with the model
Webshell attacks the URI of website.Fig. 4 shows the detection side of Webshell attacks website according to an embodiment of the invention
The flow chart of method 400, suitable for being performed in computing device, 200 such as in computing device in perform.As described in Figure 4, this method begins
In step S420.
In the step s 420, a plurality of access log to be confirmed in scheduled time slot is obtained, wherein in every access log
Including asking the associated access data of the URI and the URI of resource.Here access data are several access in step S320
Data, here is omitted.Scheduled time slot generally can be one day or two days, such as obtain the network access day in this couple of days
Will, and confirm which is Webshell attack access daily records from these network access daily records;It can certainly be other periods
Scope, this is not limited by the present invention.In addition, according to one embodiment of present invention, access day to be confirmed is got here
After will, equally can first to these access logs carry out pretreatment operation, that is, filter out static path, white list path and/or
Access log corresponding to non-Webshell suffix paths.
Then, in step S440, a plurality of access of the extraction for same URI from a plurality of access log to be confirmed
Daily record, calculates multiple URI characteristic values of the URI according to the access data of a plurality of access log, and by the plurality of URI characteristic values
It is configured to a URI characteristic vector.Wherein on multiple URI characteristic values and the computational methods of URI characteristic vectors, may be referred to walk
Description in rapid S340, is repeated no more here.
Then, in step S460, each URI characteristic vectors of a plurality of access log to be confirmed is input to and are based on
In URI disaggregated model, it is the URI corresponding to the URI characteristic vectors of negative sample mark to obtain wherein output result, and is marked
It is designated as the doubtful URI that website is attacked by Webshell.
By the way that it is appreciated that in the output result of the disaggregated model based on URI, negative sample mark represents the sample institute above
Corresponding URI is the doubtful URI that website is attacked by Webshell, and the URI that positive sample mark is then represented corresponding to the sample is just
The URI of website is asked in frequentation.Therefore the sample corresponding to each negative sample mark (output result 0) is selected from output result
URI, the as doubtful URI that website is attacked by Webshell.It can know by this method from recent network access daily record
Do not go out some doubtful URI that website is attacked by Webshell, corresponding original log can be obtained by these URI, it is acquired
Original log be doubtful Webshell attack access daily record.But generally it can also be included in these accessed daily records
Scanner scanning access log, as shown in table 5, thus also need to sometimes by scanner access daily record with really
WebShell access logs distinguish, that is, the URI and real WebShell that scanner is accessed accesses URI and distinguished.
Table 5
Therefore, according to one embodiment of present invention, a kind of disaggregated model based on access sequence can also be built, so as to
Scanner can be accessed to the URI of website and the URI of website is attacked by Webshell.
Fig. 5 shows the construction method 500 of the disaggregated model according to an embodiment of the invention based on access sequence
Flow chart, suitable for being performed in computing device, such as performed in computing device 200.As described in Figure 5, this method starts from step
S520。
In step S520, it is the normal access log for accessing website as positive sample data to obtain a plurality of have confirmed that, with
And it is a plurality of have confirmed that the access log that website is attacked for Webshell is denoted as negative sample data, wherein being wrapped in every access log
Include the associated access data of the URI and the URI of request resource.The step is similar with step S320, simply in the structure of this model
The access data for needing to use in construction method are requesting method, the conditional code of request return, beginning request time and request message
Length this four, therefore can obtain and all access data, this four access data can also be only obtained, the present invention does not make to this
Limitation.On the other details (including the operation such as pretreatment) of the step, step S320 description is referred to, it is no longer superfluous here
State.
Then, in step S540, respectively from positive sample data and negative sample extracting data for a plurality of of same URI
Access log, multiple access sequence characteristic values of the URI are calculated according to the access data of a plurality of access log, and will be the plurality of
Access sequence characteristic value is configured to an access sequence characteristic vector.
Wherein, multiple access sequence characteristic values can include the one or more in following characteristics value:
A) GET/POST asks ratio:The behavior of scanner is relatively more fixed, and the sequence of general scan request also compares fixation,
All GET requests, otherwise all POST requests, even if mixing request is also using the ratio of GET and POST request
Relatively-stationary, it is suitable to be calculated according to requesting method.
B) successful ratio is asked:For corresponding scanner, otherwise in general request all successes, otherwise all lose
Lose, its conditional code for being suitable to be returned according to request calculates.
C) access time interval average:The time interval that scanner accesses is more regular, and interval is more stable, and statistics connects
The average value of the time interval of continuous access sequence, it is suitable to be determined according to beginning request time.
D) access time interval variance:The time interval that scanner accesses is more regular, and interval is more stable, and statistics connects
The variance of the time interval of continuous access sequence, it is suitable to be determined according to beginning request time.
E) request message length average:The scan request load that scanner is sent is typically stable, counts request message
The average of length, it is suitable to be determined according to request message length.
F) request message length variance:The scan request load that scanner is sent is typically stable, counts request message
The variance of length, it is suitable to be determined according to request message length.
, equally can be by positive sample in the calculating process for carrying out multiple access sequence characteristic values similar to step S340
Data or negative sample data are imported from journal file respectively, and data frame (DataFrame) is changed into according to the implication of each field.
Then the DataFrame generated is polymerize according to URI, obtains the data row of each access data, simply selected visit here
Ask that data item is different from step S340.Wherein method, resp_code, start_time and req_bytes are used
Collect_list methods generate corresponding data row collect_list (method), collect_list (resp_ respectively
Code), collect_list (start_time) and collect_list (req_bytes).Generated according to URI polymerizations each
The form of new data row is as shown in table 6:
Table 6
URI | collect_list(method) | collect_list(resp_code) | collect_list(start_time) | collect_list(req_bytes) |
Afterwards, you can utilize access sequence characteristic value corresponding to extraction in the new data row that data are respectively accessed in table 6.Example
Such as, feature a) is extracted using data row collect_list (method), it is (big absolutely in access log generates new row get_ratio
Majority is that GET/POST is accessed, therefore other requests only statistics get_ratio can be neglected here);Utilize data row collec_
List (resp_code) extracts feature b), generate new row req_ok_ratio);Utilize data row collec_list (start_
Time feature c)) is extracted, generates new row req_interval_mean;Carried using data row collec_list (start_time)
Feature d) is taken, generates new row req_interval_var;Feature e) is extracted using data row collec_list (req_bytes),
The new row req_bytes_mean of generation;Feature f) is extracted using data row collec_list (req_bytes), generates new row
req_bytes_varss.It is also to be understood that the extraction for each characteristic value, those skilled in the art can be public with sets itself
Formula or algorithm solve to corresponding each data row, the invention is not limited in this regard.Each access sequence feature finally given
The column data format of value can be as shown in table 7:
Table 7
After getting each URI multiple access sequence characteristic values, you can be configured to visit by the plurality of access sequence characteristic value
Sequence signature vector is asked, the process is similar with the construction method of URI characteristic vectors, repeats no more here.
Then, in step S560, according to the access sequence characteristic vector of each URI in positive sample data and its it is corresponding just
Sample identification generates the second positive sample collection, and according to the access sequence characteristic vector of each URI in negative sample data and its correspondingly
Negative sample mark generation the second negative sample collection.
Similar with step S360, each sample in the second sample set corresponds to the visit corresponding to a URI, the URI
Sequence signature vector and its sample identification are asked, and each URI corresponds to a plurality of access log.Second positive sample collection is by positive sample number
The set identified according to each the access sequence characteristic vector and its positive sample of generation, the second negative sample collection is generated by negative sample data
Each access sequence characteristic vector and its negative sample mark set.Similarly, positive sample mark can be represented with 1, negative sample
Mark can use 0 mark, be certainly not limited to this.
Then, in step S580, the second training set is generated according to the second positive sample collection and the second negative sample collection, and with the
The access sequence characteristic vector of each sample is input in two training sets, using its sample identification to export, using pre-defined algorithm to the
Two training sets are trained, and obtain the disaggregated model based on access sequence.
The step is similar with step S380, wherein equally the second positive sample collection and the respectively of the generation on the second training set
Two negative sample collection are randomly divided into two groups, and optional one of which the second positive sample collection and one group of second negative sample collection are used as after merging and stated
Second training set.In addition, can equally set the different weights of positive negative sample here, it is 1, second such as to set the second positive sample weight
Negative sample collection weight is 2.Carried out accurately moreover, can equally generate the second checking set pair and be somebody's turn to do the disaggregated model based on access sequence
Property checking, the detail of model training and model checking refers to step S380 description, repeats no more here.
After disaggregated model based on access sequence is built, you can attack website by Webshell from doubtful with the model
URI in identify scanner access website URI and Webshell attack website URI.Fig. 6 is shown according to of the invention another
The flow chart of the detection method 600 of the Webshell attacks website of one embodiment, suitable for being performed in computing device, is such as being counted
Performed in calculating 200 in equipment.As described in Figure 6, this method starts from step S620.
In step S620, the doubtful a plurality of original log attacked by Webshell corresponding to the URI of website is obtained, this is more
Bar original log includes the access data that the URI and the URI of request resource are associated.Similarly, here can be first a plurality of to this
Original log carries out pretreatment operation, repeats no more here.
Then, in step S640, extraction is for same URI a plurality of access log, root from a plurality of original log
Multiple access sequence characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of access sequence is special
Value indicative is configured to an access sequence characteristic vector.
Then, in step S660, each access sequence characteristic vector of a plurality of original log is input to based on access sequence
In the disaggregated model of row, it is the URI corresponding to the access sequence characteristic vector of negative sample mark to obtain wherein output result, and will
It is labeled as the URI that website is attacked by Webshell.
On the detail of each step in method 600, the description in method 200-500 is referred to, is repeated no more here.
In addition, the input of the access sequence disaggregated model in method 600 is the doubtful visit attacked by Webshell corresponding to the URI of website
Sequence signature vector is asked, its output result is that the URI that negative sample identifies corresponding to the interval scale sample is attacked by Webshell
The URI of website, the URI represented for positive sample mark corresponding to the sample are the URI that scanner accesses website.Certainly, actual behaviour
The screening of the disaggregated model based on URI can also be skipped in work, i.e., directly chooses original log to be confirmed, according to the original day
Will is input in the disaggregated model based on access sequence after generating corresponding access sequence characteristic vector, the output so obtained
As a result the sample for positive sample mark is typically the normal URI for accessing website, and output result is the sample of negative sample mark
It is the doubtful URI that website is attacked by Webshell that can be approximately considered, so as to reduce attack data to a certain extent
The portfolio of identification.
In summary, the present invention establishes a disaggregated model based on URI, by the model and method 400 from unknown
Some doubtful URI by Webshell attacks website are have identified in the network access daily record of attribute, and then are confirmed corresponding
The doubtful access log that website is attacked by Webshell.And in order to further be locked to attack logs, the present invention is built again
A disaggregated model based on access sequence has been found, net doubtful is attacked by Webshell from those by the model and method 600
Some URI that website is really attacked by Webshell are have identified in the access log stood.Those are can determine that by these URI
Real Webshell attack access daily records, table 8 it is exemplary show some confirmed as after method 400 and 600 by
Original access log corresponding to the URI of Webshell attacks website.
Table 8
Technique according to the invention scheme, the log feature of WebShell attacks is extracted from massive logs, and use two
The method of level classification learning excavates the website for being successfully acquired WebShell.Data volume subtracts original log significantly after treatment
It is few, in practice operation experimental data can by more than original 100 ten thousand URI it is to be confirmed to be reduced to individual URI only more than 400 to be confirmed,
So as to significantly reduce the workload of manual confirmation.
B9, the method as described in B8, wherein, the disaggregated model based on access sequence is also stored with the computing device,
The model be suitable to from it is described it is doubtful by Webshell attack website URI in distinguish scanner access website URI and by
Webshell attacks the URI of website, and it is suitable to build with the following method:Obtain a plurality of have confirmed that as the normal visit for accessing website
Ask that daily record attacks the access log of website as negative sample number as positive sample data, and a plurality of have confirmed that for Webshell
According to;Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively, according to a plurality of access
The access data of daily record calculate multiple access sequence characteristic values of the URI, and the plurality of access sequence characteristic value is configured into one
Bar access sequence characteristic vector;According to the access sequence characteristic vector of each URI in positive sample data and its corresponding positive sample mark
Know the second positive sample collection of generation, and the access sequence characteristic vector according to each URI in negative sample data and its corresponding negative sample
This mark generates the second negative sample collection;And the first training set is generated according to the second positive sample collection and the second negative sample collection,
And using the access sequence characteristic vector of each sample in second training set as input, using its sample identification as output, using predetermined
Algorithm is trained to second training set, obtains the disaggregated model based on access sequence.
B10, the method as described in B8 or B9, wherein, in addition to step:Obtain and described doubtful website is attacked by Webshell
URI corresponding to a plurality of original log;Extraction is directed to same URI a plurality of access log from a plurality of original log,
Calculate multiple access sequence characteristic values of the URI according to the access data of a plurality of access log, and by the plurality of access sequence
Characteristic value is configured to an access sequence characteristic vector;It is and each access sequence characteristic vector of a plurality of original log is defeated
Enter into the disaggregated model based on access sequence, obtain access sequence feature that wherein output result is negative sample mark to
The corresponding URI of amount, and be marked as by the URI of Webshell attacks website.
B11, the method as described in B9, wherein, the access sequence characteristic value includes one kind or more in following characteristics value
Kind:GET/POST requests ratio, the successful ratio of request, access time interval average, access time interval variance, request message
Length average and request message length variance.
B12, the method as described in B11, wherein, GET/POST request ratios are suitable to be calculated according to requesting method;Ask into
The conditional code that the ratio of work(is suitable to be returned according to request calculates;Access time interval average and access time interval variance are adapted to
Determined according to request time is started;Request message length average and request message length variance are adapted to according to request message length
It is determined that.
B13, the method as described in B8, wherein, in addition to pretreatment behaviour is carried out to a plurality of access log to be confirmed
The step of making:Static path, white list path and non-Webshell suffix paths are filtered out from a plurality of access log respectively
Corresponding access log.
B14, the method as any one of B8-B10, wherein, in the output of the disaggregated model based on URI, bear
The URI that sample identification represents corresponding to the sample is the doubtful URI that website is attacked by Webshell, and positive sample mark represents the sample
URI corresponding to this is the normal URI for accessing website;In the output of the disaggregated model based on access sequence, negative sample mark
It is to be represented by the URI of Webshell attacks website, positive sample mark corresponding to the sample to know the URI represented corresponding to the sample
URI is the URI that scanner accesses website.
Various technologies described herein can combine hardware or software, or combinations thereof is realized together.So as to the present invention
Method and apparatus, or some aspects of the process and apparatus of the present invention or part can take embedded tangible media, such as soft
The form of program code (instructing) in disk, CD-ROM, hard disk drive or other any machine readable storage mediums,
Wherein when program is loaded into the machine of such as computer etc, and is performed by the machine, the machine becomes to put into practice this hair
Bright equipment.
In the case where program code performs on programmable computers, computing device generally comprises processor, processor
Readable storage medium (including volatibility and nonvolatile memory and/or memory element), at least one input unit, and extremely
A few output device.Wherein, memory is arranged to store program codes;Processor is arranged to according to the memory
Instruction in the described program code of middle storage, perform the construction method of the disaggregated model based on URI of the present invention, based on access
The detection method of construction method and Webshell the attack website of the disaggregated model of sequence.
By way of example and not limitation, computer-readable medium includes computer-readable storage medium and communication media.Calculate
Machine computer-readable recording medium includes computer-readable storage medium and communication media.Computer-readable storage medium storage such as computer-readable instruction,
The information such as data structure, program module or other data.Communication media is typically modulated with carrier wave or other transmission mechanisms etc.
Data-signal processed passes to embody computer-readable instruction, data structure, program module or other data including any information
Pass medium.Any combination above is also included within the scope of computer-readable medium.
This place provide specification in, algorithm and show not with any certain computer, virtual system or other
Equipment is inherently related.Various general-purpose systems can also be used together with the example of the present invention.As described above, construct this kind of
Structure required by system is obvious.In addition, the present invention is not also directed to any certain programmed language.Should it is bright just, can
To realize the content of invention described herein using various programming languages, and the description done above to language-specific be for
Disclose the preferred forms of the present invention.
In the specification that this place provides, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice in the case of these no details.In some instances, known method, knot is not been shown in detail
Structure and technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect,
Above in the description to the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor
The application claims of shield are than the feature more features that is expressly recited in each claim.More precisely, as following
As claims reflect, inventive aspect is all features less than single embodiment disclosed above.Therefore, abide by
Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself
Separate embodiments as the present invention.
Those skilled in the art should be understood the module or unit or group of the equipment in example disclosed herein
Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example
In different one or more equipment.Module in aforementioned exemplary can be combined as a module or be segmented into addition multiple
Submodule.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment
Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any
Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power
Profit requires, summary and accompanying drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation
Replace.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
One of meaning arbitrarily combination can use.
In addition, be described as herein can be by the processor of computer system or by performing for some in the embodiment
The method or the combination of method element that other devices of the function are implemented.Therefore, have and be used to implement methods described or method
The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, device embodiment
Element described in this is the example of following device:The device is used to implement as in order to performed by implementing the element of the purpose of the invention
Function.
As used in this, unless specifically stated so, come using ordinal number " first ", " second ", " the 3rd " etc.
Description plain objects are merely representative of the different instances for being related to similar object, and are not intended to imply that the object being so described must
Must have the time it is upper, spatially, in terms of sequence or given order in any other manner.
Although describing the present invention according to the embodiment of limited quantity, above description, the art are benefited from
Interior technical staff is bright just, in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that
The language that is used in this specification primarily to readable and teaching purpose and select, rather than in order to explain or limit
Determine subject of the present invention and select.Therefore, in the case of without departing from the scope and spirit of the appended claims, for this
Many modifications and changes will be apparent from for the those of ordinary skill of technical field.For the scope of the present invention, to this
The done disclosure of invention is illustrative and be not restrictive, and it is intended that the scope of the present invention be defined by the claims appended hereto.
Claims (10)
1. a kind of construction method of the disaggregated model based on URI, is performed in computing device, normal access website is adapted for distinguishing between
URI and the doubtful URI that website is attacked by Webshell, this method include:
It is the normal access log for accessing website as positive sample data to obtain a plurality of have confirmed that respectively, and it is a plurality of have confirmed that for
The access log of Webshell attacks website is as negative sample data, wherein every access log includes the URI for asking resource
And the access data associated with the URI;
Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively, according to a plurality of visit
Ask that the access data of daily record calculate multiple URI characteristic values of the URI, and the plurality of URI characteristic values are configured to a URI feature
Vector;
First positive sample collection is generated according to the URI characteristic vectors of each URI in positive sample data and its corresponding positive sample mark, with
And according to the URI characteristic vectors of each URI in negative sample data and its corresponding negative sample mark the first negative sample collection of generation;And
First training set is generated according to the first positive sample collection and the first negative sample collection, and with each sample in first training set
URI characteristic vectors for input, using its sample identification for export, first training set is trained using pre-defined algorithm,
Obtain the disaggregated model based on URI.
2. the method for claim 1, wherein the access data of the access log include following parameter in one kind or
It is a variety of:
Ask the conditional code, CDN hit conditions, fire wall detection that the IP of user, requesting method, request return attack type,
Required parameter, start request time and request message length.
3. method as claimed in claim 2, wherein, the multiple URI characteristic values include one kind or more in following characteristics value
Kind:
Access in URI client ip quantity, access URI total degree, access URI and return to frequency of failure ratio, access in URI
Whether hit CDN is had by WAF interception requests ratio, the URI accessed, accesses required parameter change frequency in URI.
4. method as claimed in claim 3, wherein,
The client ip quantity of the access URI is suitable to be calculated according to the IP of request user;
The attack type calculating that the total degree of the access URI is suitable to the conditional code returned according to request or fire wall detects;
The conditional code that frequency of failure ratio is suitable to return according to request is returned in the access URI to calculate;
Calculated in the access URI by the attack type that fire wall interception request ratio is suitable to be detected according to fire wall;
Whether the URI of the access has hit CDN to be suitable to be determined according to CDN hit conditions;And
Required parameter change frequency is suitable to be calculated according to required parameter in the access URI.
5. method as claimed in claim 4, wherein, the access data according to a plurality of access log calculate the URI's
The step of multiple URI characteristic values, includes:
Positive sample data and negative sample data are converted into data frame according to the implication of each field;And
The data frame is polymerize according to URI, obtains the data row of each access data, and the URI corresponding to extraction from each data row
Characteristic value;
Wherein, IP, CDN hit condition of user, required parameter is asked to be suitable to using collect_set methods generation data row,
The attack type of the conditional code and fire wall detection returned is asked to be suitable to using collect_list methods generation data row.
6. the method for claim 1, wherein also include step:
According to the first positive sample collection and the first negative sample collection generation the first checking collection;
The URI characteristic vectors of each sample are concentrated to be input in the disaggregated model based on URI the first checking, prediction obtains each
The sample identification of sample;And
The sample identification for predicting obtained each sample is compared with its actual sample identification, calculates the classification based on URI
The accuracy of model.
7. in method as claimed in claim 6, wherein, first training set and the first checking collection are suitable to according to following methods
Generation:
The first positive sample collection and the first negative sample collection are randomly divided into two groups respectively;And
Optional one of which the first positive sample collection and one group of first negative sample collection are used as first training set after merging, and will be another
One group of first positive sample collection and another group of the first negative sample collection collect after merging as the described first checking.
8. a kind of detection method of Webshell attacks website, suitable for being performed in computing device, is stored in the computing device
Just like the disaggregated model based on URI any one of claim 1-7, this method includes:
The a plurality of access log to be confirmed in scheduled time slot is obtained, wherein every access log includes the URI for asking resource
And the access data associated with the URI;
Extraction is directed to same URI a plurality of access log from a plurality of access log to be confirmed, according to a plurality of access
The access data of daily record calculate multiple URI characteristic values of the URI, and by the plurality of URI characteristic values be configured to a URI feature to
Amount;And
Each URI characteristic vectors of a plurality of access log to be confirmed are input in the disaggregated model based on URI, obtained
To the URI corresponding to the URI characteristic vectors that wherein output result is negative sample mark, and it is marked as doubtful by Webshell
Attack the URI of website.
9. a kind of computing device, including:
One or more processors;
Memory;And
One or more programs, wherein one or more of program storages are in the memory and are configured as by described one
Individual or multiple computing devices, one or more of programs include being used to perform according in claim 1-7 or claim 8
The instruction of either method in described method.
10. a kind of computer-readable recording medium for storing one or more programs, one or more of programs include instruction,
The instruction is when executed by a computing apparatus so that the computing device is according to claim 1-7 or claim 8
Method in either method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711276201.4A CN107888616B (en) | 2017-12-06 | 2017-12-06 | Construction method of classification model based on URI and detection method of Webshell attack website |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711276201.4A CN107888616B (en) | 2017-12-06 | 2017-12-06 | Construction method of classification model based on URI and detection method of Webshell attack website |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107888616A true CN107888616A (en) | 2018-04-06 |
CN107888616B CN107888616B (en) | 2020-06-05 |
Family
ID=61773179
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711276201.4A Active CN107888616B (en) | 2017-12-06 | 2017-12-06 | Construction method of classification model based on URI and detection method of Webshell attack website |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107888616B (en) |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763470A (en) * | 2018-05-29 | 2018-11-06 | 北京白山耘科技有限公司 | A kind of method and device detecting dangerous information by text message |
CN108920959A (en) * | 2018-07-21 | 2018-11-30 | 杭州安恒信息技术股份有限公司 | A kind of webshell detection method based on Bayesian model optimization |
CN109101527A (en) * | 2018-06-21 | 2018-12-28 | 中国科学院信息工程研究所 | A kind of magnanimity security log information filter method and device |
CN109508542A (en) * | 2018-10-26 | 2019-03-22 | 国家计算机网络与信息安全管理中心江苏分中心 | WEB method for detecting abnormality, system and server under big data environment |
CN109525551A (en) * | 2018-10-07 | 2019-03-26 | 杭州安恒信息技术股份有限公司 | A method of the CC based on statistical machine learning attacks protection |
CN109600382A (en) * | 2018-12-19 | 2019-04-09 | 北京知道创宇信息技术有限公司 | Webshell detection method and device, HMM model training method and device |
CN110175278A (en) * | 2019-05-24 | 2019-08-27 | 新华三信息安全技术有限公司 | The detection method and device of web crawlers |
CN110351299A (en) * | 2019-07-25 | 2019-10-18 | 新华三信息安全技术有限公司 | A kind of network connection detection method and device |
CN110602137A (en) * | 2019-09-25 | 2019-12-20 | 光通天下网络科技股份有限公司 | Malicious IP and malicious URL intercepting method, device, equipment and medium |
CN110868419A (en) * | 2019-11-18 | 2020-03-06 | 杭州安恒信息技术股份有限公司 | Method and device for detecting WEB backdoor attack event and electronic equipment |
CN110933115A (en) * | 2019-12-31 | 2020-03-27 | 上海观安信息技术股份有限公司 | Analysis object behavior abnormity detection method and device based on dynamic session |
CN110968564A (en) * | 2018-09-28 | 2020-04-07 | 阿里巴巴集团控股有限公司 | Data processing method and training method of data state prediction model |
CN111107096A (en) * | 2019-12-27 | 2020-05-05 | 杭州迪普科技股份有限公司 | Web site safety protection method and device |
CN111600894A (en) * | 2020-05-20 | 2020-08-28 | 新华三信息安全技术有限公司 | Network attack detection method and device |
CN113132329A (en) * | 2019-12-31 | 2021-07-16 | 深信服科技股份有限公司 | WEBSHELL detection method, device, equipment and storage medium |
WO2021169239A1 (en) * | 2020-02-24 | 2021-09-02 | 网宿科技股份有限公司 | Crawler data recognition method, system and device |
CN113779571A (en) * | 2020-06-10 | 2021-12-10 | 中国电信股份有限公司 | WebShell detection device, WebShell detection method and computer-readable storage medium |
CN113783889A (en) * | 2021-09-22 | 2021-12-10 | 南方电网数字电网研究院有限公司 | Firewall control method for linkage access of network layer and application layer and firewall thereof |
WO2022117063A1 (en) * | 2020-12-03 | 2022-06-09 | 百果园技术(新加坡)有限公司 | Method and apparatus for training isolation forest, and method and apparatus for recognizing web crawler |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102801698A (en) * | 2011-12-20 | 2012-11-28 | 北京安天电子设备有限公司 | Uniform resource locator (URL) request time sequence-based detection method and system for malicious codes |
CN103684896A (en) * | 2012-09-07 | 2014-03-26 | 中国科学院计算机网络信息中心 | Method of detecting website cheating based on domain name resolution characteristics |
CN104468477A (en) * | 2013-09-16 | 2015-03-25 | 杭州迪普科技有限公司 | WebShell detection method and system |
CN104766014A (en) * | 2015-04-30 | 2015-07-08 | 安一恒通(北京)科技有限公司 | Method and system used for detecting malicious website |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
CN106961419A (en) * | 2017-02-13 | 2017-07-18 | 深信服科技股份有限公司 | WebShell detection methods, apparatus and system |
CN107332848A (en) * | 2017-07-05 | 2017-11-07 | 重庆邮电大学 | A kind of exception of network traffic real-time monitoring system based on big data |
CN107404497A (en) * | 2017-09-05 | 2017-11-28 | 成都知道创宇信息技术有限公司 | A kind of method that WebShell is detected in massive logs |
-
2017
- 2017-12-06 CN CN201711276201.4A patent/CN107888616B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102801698A (en) * | 2011-12-20 | 2012-11-28 | 北京安天电子设备有限公司 | Uniform resource locator (URL) request time sequence-based detection method and system for malicious codes |
CN103684896A (en) * | 2012-09-07 | 2014-03-26 | 中国科学院计算机网络信息中心 | Method of detecting website cheating based on domain name resolution characteristics |
CN104468477A (en) * | 2013-09-16 | 2015-03-25 | 杭州迪普科技有限公司 | WebShell detection method and system |
CN104766014A (en) * | 2015-04-30 | 2015-07-08 | 安一恒通(北京)科技有限公司 | Method and system used for detecting malicious website |
CN105956472A (en) * | 2016-05-12 | 2016-09-21 | 宝利九章(北京)数据技术有限公司 | Method and system for identifying whether webpage includes malicious content or not |
CN106961419A (en) * | 2017-02-13 | 2017-07-18 | 深信服科技股份有限公司 | WebShell detection methods, apparatus and system |
CN107332848A (en) * | 2017-07-05 | 2017-11-07 | 重庆邮电大学 | A kind of exception of network traffic real-time monitoring system based on big data |
CN107404497A (en) * | 2017-09-05 | 2017-11-28 | 成都知道创宇信息技术有限公司 | A kind of method that WebShell is detected in massive logs |
Non-Patent Citations (1)
Title |
---|
石刘洋 等: "基于Web日志的Webshell检测方法研究", 《信息安全研究》 * |
Cited By (27)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019228158A1 (en) * | 2018-05-29 | 2019-12-05 | 北京白山耘科技有限公司 | Method and apparatus for detecting dangerous information by means of text information, medium, and device |
CN108763470A (en) * | 2018-05-29 | 2018-11-06 | 北京白山耘科技有限公司 | A kind of method and device detecting dangerous information by text message |
CN109101527A (en) * | 2018-06-21 | 2018-12-28 | 中国科学院信息工程研究所 | A kind of magnanimity security log information filter method and device |
CN108920959A (en) * | 2018-07-21 | 2018-11-30 | 杭州安恒信息技术股份有限公司 | A kind of webshell detection method based on Bayesian model optimization |
CN108920959B (en) * | 2018-07-21 | 2020-12-01 | 杭州安恒信息技术股份有限公司 | Webshell detection method based on Bayesian model optimization |
CN110968564B (en) * | 2018-09-28 | 2023-04-25 | 阿里巴巴集团控股有限公司 | Data processing method and training method of data state prediction model |
CN110968564A (en) * | 2018-09-28 | 2020-04-07 | 阿里巴巴集团控股有限公司 | Data processing method and training method of data state prediction model |
CN109525551A (en) * | 2018-10-07 | 2019-03-26 | 杭州安恒信息技术股份有限公司 | A method of the CC based on statistical machine learning attacks protection |
CN109508542B (en) * | 2018-10-26 | 2019-11-22 | 国家计算机网络与信息安全管理中心江苏分中心 | WEB method for detecting abnormality, system and server under big data environment |
CN109508542A (en) * | 2018-10-26 | 2019-03-22 | 国家计算机网络与信息安全管理中心江苏分中心 | WEB method for detecting abnormality, system and server under big data environment |
CN109600382A (en) * | 2018-12-19 | 2019-04-09 | 北京知道创宇信息技术有限公司 | Webshell detection method and device, HMM model training method and device |
CN109600382B (en) * | 2018-12-19 | 2021-07-13 | 北京知道创宇信息技术股份有限公司 | Webshell detection method and device and HMM model training method and device |
CN110175278A (en) * | 2019-05-24 | 2019-08-27 | 新华三信息安全技术有限公司 | The detection method and device of web crawlers |
CN110351299A (en) * | 2019-07-25 | 2019-10-18 | 新华三信息安全技术有限公司 | A kind of network connection detection method and device |
CN110602137A (en) * | 2019-09-25 | 2019-12-20 | 光通天下网络科技股份有限公司 | Malicious IP and malicious URL intercepting method, device, equipment and medium |
CN110868419A (en) * | 2019-11-18 | 2020-03-06 | 杭州安恒信息技术股份有限公司 | Method and device for detecting WEB backdoor attack event and electronic equipment |
CN111107096A (en) * | 2019-12-27 | 2020-05-05 | 杭州迪普科技股份有限公司 | Web site safety protection method and device |
CN110933115A (en) * | 2019-12-31 | 2020-03-27 | 上海观安信息技术股份有限公司 | Analysis object behavior abnormity detection method and device based on dynamic session |
CN113132329A (en) * | 2019-12-31 | 2021-07-16 | 深信服科技股份有限公司 | WEBSHELL detection method, device, equipment and storage medium |
CN110933115B (en) * | 2019-12-31 | 2022-04-29 | 上海观安信息技术股份有限公司 | Analysis object behavior abnormity detection method and device based on dynamic session |
WO2021169239A1 (en) * | 2020-02-24 | 2021-09-02 | 网宿科技股份有限公司 | Crawler data recognition method, system and device |
CN111600894A (en) * | 2020-05-20 | 2020-08-28 | 新华三信息安全技术有限公司 | Network attack detection method and device |
CN111600894B (en) * | 2020-05-20 | 2023-05-16 | 新华三信息安全技术有限公司 | Network attack detection method and device |
CN113779571A (en) * | 2020-06-10 | 2021-12-10 | 中国电信股份有限公司 | WebShell detection device, WebShell detection method and computer-readable storage medium |
CN113779571B (en) * | 2020-06-10 | 2024-04-26 | 天翼云科技有限公司 | WebShell detection device, webShell detection method and computer readable storage medium |
WO2022117063A1 (en) * | 2020-12-03 | 2022-06-09 | 百果园技术(新加坡)有限公司 | Method and apparatus for training isolation forest, and method and apparatus for recognizing web crawler |
CN113783889A (en) * | 2021-09-22 | 2021-12-10 | 南方电网数字电网研究院有限公司 | Firewall control method for linkage access of network layer and application layer and firewall thereof |
Also Published As
Publication number | Publication date |
---|---|
CN107888616B (en) | 2020-06-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107888616A (en) | The detection method of construction method and Webshell the attack website of disaggregated model based on URI | |
US11190562B2 (en) | Generic event stream processing for machine learning | |
Harinahalli Lokesh et al. | Phishing website detection based on effective machine learning approach | |
Sheikhan et al. | Intrusion detection using reduced-size RNN based on feature grouping | |
Ali Alheeti et al. | Intelligent intrusion detection in external communication systems for autonomous vehicles | |
CN111614599B (en) | Webshell detection method and device based on artificial intelligence | |
CN107729532A (en) | A kind of resume matching process and computing device | |
US20130042306A1 (en) | Determining machine behavior | |
US11593475B2 (en) | Security information analysis device, security information analysis method, security information analysis program, security information evaluation device, security information evaluation method, security information analysis system, and recording medium | |
CN106992981B (en) | Website backdoor detection method and device and computing equipment | |
CN107003976A (en) | Based on active rule can be permitted determine that activity can be permitted | |
Chu et al. | Bot or human? A behavior-based online bot detection system | |
CN110830445B (en) | Method and device for identifying abnormal access object | |
CN110855648B (en) | Early warning control method and device for network attack | |
WO2021068563A1 (en) | Sample date processing method, device and computer equipment, and storage medium | |
CN111224941B (en) | Threat type identification method and device | |
Abawajy et al. | Hybrid consensus pruning of ensemble classifiers for big data malware detection | |
CN111680167A (en) | Service request response method and server | |
Eldos et al. | On the KDD'99 Dataset: Statistical Analysis for Feature Selection | |
Hajdu et al. | Use of artificial neural networks to identify fake profiles | |
CN115314239A (en) | Analysis method and related equipment for hidden malicious behaviors based on multi-model fusion | |
RU2745362C1 (en) | System and method of generating individual content for service user | |
CN113822684A (en) | Heikou user recognition model training method and device, electronic equipment and storage medium | |
CN114915434A (en) | Network agent detection method, device, storage medium and computer equipment | |
She et al. | An improved malicious code intrusion detection method based on target tree for space information network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing 100102 Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd. Address before: 100097 Jinwei Building 803, 55 Lanindichang South Road, Haidian District, Beijing Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |