CN107888616A - The detection method of construction method and Webshell the attack website of disaggregated model based on URI - Google Patents

The detection method of construction method and Webshell the attack website of disaggregated model based on URI Download PDF

Info

Publication number
CN107888616A
CN107888616A CN201711276201.4A CN201711276201A CN107888616A CN 107888616 A CN107888616 A CN 107888616A CN 201711276201 A CN201711276201 A CN 201711276201A CN 107888616 A CN107888616 A CN 107888616A
Authority
CN
China
Prior art keywords
uri
access
data
website
webshell
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201711276201.4A
Other languages
Chinese (zh)
Other versions
CN107888616B (en
Inventor
陈金战
杨旭
张通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Knownsec Information Technology Co Ltd
Original Assignee
Beijing Knownsec Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Knownsec Information Technology Co Ltd filed Critical Beijing Knownsec Information Technology Co Ltd
Priority to CN201711276201.4A priority Critical patent/CN107888616B/en
Publication of CN107888616A publication Critical patent/CN107888616A/en
Application granted granted Critical
Publication of CN107888616B publication Critical patent/CN107888616B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a kind of construction method of the disaggregated model based on URI, performed in computing device, including:It is that normally the access log of access website and Webshell attacks website is as positive sample data and negative sample data to obtain a plurality of have confirmed that respectively, wherein every access log includes the access data asked the URI of resource and associated with the URI;Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively, multiple URI characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of URI characteristic values are configured to a URI characteristic vector;Respectively according to the URI characteristic vectors of each URI in positive/negative sample data and its corresponding positive sample mark the first positive/negative sample set of generation, and the first training set is generated according to the two sample sets;Using the URI characteristic vectors of each sample in first training set as input, using its sample identification as output, the first training set is trained using pre-defined algorithm, obtains the disaggregated model based on URI.

Description

The detection of construction method and Webshell the attack website of disaggregated model based on URI Method
Technical field
The present invention relates to Internet technical field, more particularly to a kind of construction method of the disaggregated model based on URI, Webshell attacks the detection method and computing device of website.
Background technology
Webshell is a kind of order performing environment existing in the form of the web page files such as asp, php, jsp, cgi, also may be used To be referred to as a kind of webpage back door.Invader often places Webshell after website is invaded in the WEB catalogues of WEB server Backdoor file, and mixed with normal file under WEB server WEB catalogues, it is not easy to be found.Invader can use WEB Mode accesses Webshell and obtains order performing environment to reach the purpose of control website or WEB server, the operation that can be carried out Including uploading download file, checking database, execution random procedure order etc..
The data that distance host exchanges are transmitted by 80 ports, therefore will not be intercepted by fire wall.And use Webshell will not typically leave record in system journal, and the submission of some data can be only left in the daily record of WEB server Record, it is difficult to find out invasion vestige that unfamiliar keeper, which is,.
It is that rule-based and feature database is detected mostly, such as in the existing WebShell detection methods to access log By disclosed WebShell on collection network and its feature is analyzed, or adds some sensitivity functions and establishes WebShell features Storehouse, these features or sensitivity function are then matched in the access log of website, if matched, carrying out manual confirmation, it is No is WebShell.This detection depends on the accumulation of existing WebShell attacks, and can only detect known Attack, corresponding unknown WebShell are difficult to find.
A kind of accordingly, it is desirable to provide more effective comprehensive WebShell detection methods.
The content of the invention
Therefore, the present invention provides the detection of a kind of construction method of the disaggregated model based on URI, Webshell attacks website Method and computing device, exist above to try hard to solve the problems, such as or at least alleviate.
According to an aspect of the invention, there is provided a kind of construction method of the disaggregated model based on URI, in computing device Middle execution, is adapted for distinguishing between the normal URI for accessing website and the doubtful URI that website is attacked by Webshell, and this method includes:Respectively It is the normal access log for accessing website as positive sample data to obtain a plurality of have confirmed that, and a plurality of is had confirmed that as Webshell Attack website access log be used as negative sample data, wherein every access log include request resource URI and with the URI Associated access data;Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively, Multiple URI characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of URI characteristic values are constructed For a URI characteristic vector;According to the URI characteristic vectors of each URI in positive sample data and its mark generation of corresponding positive sample First positive sample collection, and according to the URI characteristic vectors of each URI in negative sample data and its corresponding negative sample mark generation the One negative sample collection;And the first training set is generated according to the first positive sample collection and the first negative sample collection, and with first training set In each sample URI characteristic vectors for input, using its sample identification for export, the first training set is instructed using pre-defined algorithm Practice, obtain the disaggregated model based on URI.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, the access data of access log Including the one or more in following parameter:Ask IP, requesting method, the conditional code of request return, the CDN hit shapes of user In state, the attack type of fire wall detection, required parameter, beginning request time and request message length.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, multiple URI characteristic values include One or more in following characteristics value:Access and returned in URI client ip quantity, access URI total degree, access URI Frequency of failure ratio, access in URI and asked by whether WAF interception requests ratio, the URI accessed have in hit CDN, access URI Parameters variation number.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, URI client ip is accessed Quantity is suitable to be calculated according to the IP of request user;The total degree for accessing URI is suitable to conditional code or the fire wall returned according to request The attack type of detection calculates;Access and frequency of failure ratio is returned in URI suitable for being calculated according to the conditional code of request return;Access Calculated in URI by the attack type that fire wall interception request ratio is suitable to be detected according to fire wall;Whether the URI of access has hit CDN is suitable to be determined according to CDN hit conditions;And access required parameter change frequency in URI and be suitable to be calculated according to required parameter.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, according to a plurality of access log Access data the step of calculating multiple URI characteristic values of the URI include:By positive sample data and negative sample data according to each word The implication of section is converted into data frame;And polymerize the data frame according to URI, the data row of each access data are obtained, and from each URI characteristic values corresponding to extraction in data row;Wherein, IP, CDN hit condition of user, required parameter is asked to be suitable to use Collect_set methods generation data row, the attack type of the conditional code and fire wall detection returned is asked to be suitable to use Collect_list methods generation data row.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, in addition to step:According to One positive sample collection and the first negative sample collection generation the first checking collection;The URI characteristic vectors that each sample is concentrated in first checking are inputted Into the disaggregated model based on URI, prediction obtains the sample identification of each sample;And the sample mark that obtained each sample will be predicted Knowledge compares with its actual sample identification, calculates the accuracy of the disaggregated model based on URI.
Alternatively, in the construction method of the disaggregated model based on URI according to the present invention, the first training set and first is tested Card collection is suitable to be generated according to following methods:The first positive sample collection and the first negative sample collection are randomly divided into two groups respectively;And appoint The first training set is used as after selecting one of which the first positive sample collection and one group of first negative sample collection merging, and by another group first just Sample set and another group of the first negative sample collection collect after merging as the first checking.
According to another aspect of the present invention, there is provided a kind of detection method of Webshell attacks website, suitable for calculating Performed in equipment, the disaggregated model based on URI, this method as described above are stored with computing device to be included:Obtain pre- timing A plurality of access log to be confirmed in section, wherein what URI and the URI that every access log includes request resource were associated Access data;Extraction is directed to same URI a plurality of access log from a plurality of access log to be confirmed, according to a plurality of access The access data of daily record calculate multiple URI characteristic values of the URI, and by the plurality of URI characteristic values be configured to a URI feature to Amount;And each URI characteristic vectors of a plurality of access log to be confirmed are input in the disaggregated model based on URI, obtain it Middle output result is the URI corresponding to the URI characteristic vectors of negative sample mark, and is marked as doubtful being attacked by Webshell The URI of website.
Alternatively, attacked in the detection method of website according to the Webshell of the present invention, be also stored with computing device Disaggregated model based on access sequence, the model are suitable to distinguish scanner visit from the doubtful URI for being attacked website by Webshell Ask the URI of website and the URI of website is attacked by Webshell, it is suitable to build with the following method:Obtain it is a plurality of have confirmed that for The normal access log for accessing website is used as positive sample data, and a plurality of has confirmed that as the access day of Webshell attacks website Will is denoted as negative sample data;Same URI a plurality of access day is directed to from positive sample data and negative sample extracting data respectively Will, calculates multiple access sequence characteristic values of the URI according to the access data of a plurality of access log, and by the plurality of access sequence Row characteristic value is configured to an access sequence characteristic vector;According to the access sequence characteristic vector of each URI in positive sample data and Its corresponding positive sample mark second positive sample collection of generation, and according to the access sequence feature of each URI in negative sample data to Amount and its corresponding negative sample mark the second negative sample collection of generation;And generated according to the second positive sample collection and the second negative sample collection First training set, and using the access sequence characteristic vector of each sample in second training set as input, using its sample identification to be defeated Go out, the second training set is trained using pre-defined algorithm, obtains the disaggregated model based on access sequence.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, in addition to step:Obtain doubtful By a plurality of original log corresponding to the URI of Webshell attacks website;Extraction is for same URI's from a plurality of original log A plurality of access log, multiple access sequence characteristic values of the URI are calculated according to the access data of a plurality of access log, and should Multiple access sequence characteristic values are configured to an access sequence characteristic vector;It is and each access sequence of a plurality of original log is special Sign vector is input in the disaggregated model based on access sequence, and it is special to obtain the access sequence that wherein output result is negative sample mark The URI corresponding to vector is levied, and is marked as by the URI of Webshell attacks website.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, access sequence characteristic value includes One or more in following characteristics value:GET/POST requests ratio, the successful ratio of request, access time interval average, visit Ask time interval variance, request message length average and request message length variance.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, GET/POST requests ratio is fitted Calculated according to requesting method;The conditional code for asking successful ratio to be suitable to be returned according to request calculates;Access time, interval was equal Value and access time interval variance are adapted to be determined according to beginning request time;Request message length average and request message length Variance is adapted to be determined according to request message length.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, in addition to a plurality of to be confirmed Access log carry out pretreatment operation the step of:Static path, white list road are filtered out from a plurality of access log respectively Access log corresponding to footpath and non-Webshell suffix paths.
Alternatively, attacked according to the Webshell of the present invention in the detection method of website, the disaggregated model based on URI In output, the URI that represents corresponding to the sample of negative sample mark is the doubtful URI that website is attacked by Webshell, positive sample mark It is the normal URI for accessing website to know the URI represented corresponding to the sample;In the output of disaggregated model based on access sequence, bear The URI that sample identification represents corresponding to the sample is to represent the sample institute by the URI of Webshell attacks website, positive sample mark Corresponding URI is the URI that scanner accesses website.
According to another aspect of the present invention, there is provided a kind of computing device, including:One or more processors;Memory; With one or more programs, wherein one or more of program storages are in the memory and are configured as by one Or multiple computing devices, one or more of programs include being used to perform the disaggregated model based on URI as described above The instruction of the detection method of construction method and Webshell attacks website.
In accordance with a further aspect of the present invention, there is provided a kind of computer-readable storage medium for storing one or more programs Matter, one or more of programs include instruction, and the instruction is when computing device so that the computing device is such as The construction method of the upper described disaggregated model based on URI and the detection method of Webshell attacks website.
Technique according to the invention scheme, established based on normal access log and the Webshell attack logs having confirmed that One disaggregated model based on URI, some can be have identified from the network access daily record of unknown properties by the model and doubted The URI of website is seemingly attacked by Webshell, and then may filter that out the doubtful access log that website is attacked by Webshell.So may be used Massive logs is become minimal amount of data to be confirmed, so as to substantially reduce the cost of manual confirmation, improve WebShell Detection efficiency.
Further, the present invention establishes a disaggregated model based on access sequence again, can be from that by the model A little doubtful attacked by Webshell in the access log of website identify really really Webshell attack access daily records, so as to Realize the further locking to attack logs.Excavated by this two-stage classification algorithm model among network log by success The URI of WebShell attacks, so as to further reduce data volume to be confirmed, and improve WebShell detection efficiencies.
Brief description of the drawings
In order to realize above-mentioned and related purpose, some illustrative sides are described herein in conjunction with following description and accompanying drawing Face, these aspects indicate the various modes that can put into practice principles disclosed herein, and all aspects and its equivalent aspect It is intended to fall under in the range of theme claimed.Read following detailed description in conjunction with the accompanying drawings, the disclosure it is above-mentioned And other purposes, feature and advantage will be apparent.Throughout the disclosure, identical reference generally refers to identical Part or element.
Fig. 1 shows the schematic diagram of network request processing system 100 according to an embodiment of the invention;
Fig. 2 shows the schematic diagram of computing device 200 according to an embodiment of the invention;
Fig. 3 shows the flow of the construction method 300 of the disaggregated model according to an embodiment of the invention based on URI Figure;
Fig. 4 shows the flow chart of the detection method 400 of Webshell attacks website according to an embodiment of the invention;
Fig. 5 shows the construction method 500 of the disaggregated model according to an embodiment of the invention based on access sequence Flow chart;And
Fig. 6 shows the flow of the detection method 600 of Webshell attacks website in accordance with another embodiment of the present invention Figure.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Completely it is communicated to those skilled in the art.
Fig. 1 shows the schematic diagram of network request processing system 100 according to an embodiment of the invention.As shown in figure 1, The system includes multiple client 110 (such as client 1-m), CDN (Content Delivery Network, i.e. content distribution Network) node 120, fire wall (WAF) node 130, data storage device 140, computing device 150 and multiple network stations of rear end 160 (such as websites 1 and 2) of point.Wherein, data storage device 140 and CDN node 120,130 equal communication link of firewall node Connect, computing device 150 communicates to connect with data storage device 140.It should be pointed out that the network request processing system 100 in Fig. 1 is only It is exemplary, the client, CDN node, WAF that can have varying number in specific practice situation, in system 100 save Point, data storage device and computing device, the present invention are not limited to each number of devices included by system 100.
Client 110 can be implemented as the web browser that WWW uses, or client software of instant messaging etc., It is generally arranged on personal computer, mobile phone, tablet personal computer, personal media player device, wireless network browsing apparatus, should With in the client computer such as special equipment.Data storage device 140 can be resided in computing device 150 as local data base, It can be arranged on as remote data base outside computing device 150, be also used as distributed data base and set such as HBase In multiple geographical locations.In a word, data storage device 140 is used for data storage, but the present invention is to data storage device 140 Concrete configuration situation be not limited.Computing device 150 can be deposited by wired or wireless mode via internet to data Data in storage device 140 are written and read.Generally, all network access daily records are collected in data storage device 140, calculating is set Standby 150 obtain corresponding access log from the data storage device, and to wherein belonging to the network log of Webshell attacks It is identified.
Fig. 2 is the block diagram of Example Computing Device 200.In basic configuration 202, computing device 200, which typically comprises, is System memory 206 and one or more processor 204.Memory bus 208 can be used in processor 204 and system storage Communication between device 206.
Depending on desired configuration, processor 204 can be any kind of processing, include but is not limited to:Microprocessor (μ P), microcontroller (μ C), digital information processor (DSP) or any combination of them.Processor 204 can be included such as The cache of one or more rank of on-chip cache 210 and second level cache 212 etc, processor core 214 and register 216.The processor core 214 of example can include arithmetic and logical unit (ALU), floating-point unit (FPU), Digital signal processing core (DSP core) or any combination of them.The Memory Controller 218 of example can be with processor 204 are used together, or in some implementations, Memory Controller 218 can be an interior section of processor 204.
Depending on desired configuration, system storage 206 can be any type of memory, include but is not limited to:Easily The property lost memory (RAM), nonvolatile memory (ROM, flash memory etc.) or any combination of them.System stores Device 206 can include operating system 220, one or more apply 222 and routine data 224.In some embodiments, It may be arranged to be operated using routine data 224 on an operating system using 222.Routine data 224 includes instruction, in root In computing device 200 according to the present invention, routine data 224 includes the construction method for being used for performing the disaggregated model based on URI 300th, in the detection method 400 and 600 of the construction method 500 of the disaggregated model based on access sequence and Webshell attacks website The instruction of at least one method.
Computing device 200 can also include contributing to from various interface equipments (for example, output equipment 242, Peripheral Interface 244 and communication equipment 246) to basic configuration 202 via the communication of bus/interface controller 230 interface bus 240.Example Output equipment 242 include graphics processing unit 2148 and audio treatment unit 250.They can be configured as contributing to via One or more A/V port 252 is communicated with the various external equipments of such as display or loudspeaker etc.Outside example If interface 244 can include serial interface controller 254 and parallel interface controller 256, they can be configured as contributing to Via one or more I/O port 258 and such as input equipment (for example, keyboard, mouse, pen, voice-input device, touch Input equipment) or the external equipment of other peripheral hardwares (such as printer, scanner etc.) etc communicated.The communication of example is set Standby 246 can include network controller 260, and it can be arranged to be easy to via one or more COM1 264 and one The communication that other individual or multiple computing devices 262 pass through network communication link.
Network communication link can be an example of communication media.Communication media can be generally presented as in such as carrier wave Or computer-readable instruction in the modulated data signal of other transmission mechanisms etc, data structure, program module, and can With including any information delivery media." modulated data signal " can such signal, one in its data set or more It is individual or it change can the mode of coding information in the signal carry out.As nonrestrictive example, communication media can be with Include the wire medium of such as cable network or private line network etc, and it is such as sound, radio frequency (RF), microwave, infrared (IR) the various wireless mediums or including other wireless mediums.Term computer-readable medium used herein can include depositing Both storage media and communication media.
Computing device 200 can be implemented as server, such as file server, database server, application program service Device and WEB server etc., a part for portable (or mobile) electronic equipment of small size, these electronic equipments can also be embodied as Can be such as cell phone, personal digital assistant (PDA), personal media player device, wireless network browsing apparatus, individual Helmet, application specific equipment or the mixing apparatus that any of the above function can be included.Computing device 200 can also be real It is now to include desktop computer and the personal computer of notebook computer configuration.In certain embodiments, the quilt of computing device 200 It is configured to perform the construction method 300 of the disaggregated model based on URI according to the present invention, disaggregated model based on access sequence At least one of the detection method 400 and 600 of construction method 500 and Webshell attacks website method.
In addition, the disaggregated model based on URI and the disaggregated model based on access sequence are also stored with computing device 200. Wherein, based on log feature of the URI disaggregated model based on URI, using Logic Regression Models, suitable for will normally access website URI and the doubtful URI that website is attacked by Webshell are distinguished.Disaggregated model based on access sequence is suitable to from doubtful quilt The URI that scanner accesses the URI of website and website is attacked by Webshell is distinguished in the URI of Webshell attacks website.
Fig. 3 shows the flow of the construction method 300 of the disaggregated model according to an embodiment of the invention based on URI Figure, suitable for being performed in computing device, is such as performed in computing device 200.As described in Figure 3, this method starts from step S320.
In step s 320, it is normally to access the access log of website as positive sample number to obtain a plurality of have confirmed that respectively According to, and it is a plurality of have confirmed that the access log that website is attacked for Webshell is denoted as negative sample data, wherein every access day Will includes asking the URI of resource and the access data associated with the URI.
Wherein, the access data of access log include the one or more in following parameter:Ask the IP of user (client_ip), requesting method (method, GET or POST), the conditional code (resp_code) of request return, CDN hits State (hit_status), fire wall detection attack type (attack_type), required parameter (query), start request when Between (start_time) and request message length (req_bytes).It should be noted that in the structure of the disaggregated model based on URI Build middle method actually and requesting method need not be used, start request time and request message length this three parameters, therefore This three access data and wouldn't can also obtain.These access the log file formats that data are formed can be as shown in table 1:
Table 1
URI client_ip method resp_code hit_status attack_type query start_time req_bytes
According to one embodiment of present invention, after getting original access log, first these access logs can also be entered Row pretreatment operation.Specifically, static path, white list path can be filtered out from both original access logs respectively And/or the access log corresponding to non-Webshell suffix paths.
Then, in step S340, respectively from positive sample data and negative sample extracting data for a plurality of of same URI Access log, multiple URI characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of URI is special Value indicative is configured to a URI characteristic vector.
It is, there is a plurality of access log in positive sample data, every access log has a URI, thus can be from A plurality of access log of the extraction for same URI in all access logs of positive sample data.For example, have in positive sample data 1000 access logs, wherein have 50 for URI-1 access log, and this 50 access logs have a such as institute of table 1 The access data shown.So, after the access data for URI-1 this 50 access logs are collected, so that it may obtain URI-1 multiple URI characteristic values.
Wherein, multiple URI characteristic values include the one or more in following characteristics value:
1) URI client ip quantity is accessed:Within a sufficiently long cycle, the client terminal quantity one of normal network address is accessed The fixed client terminal quantity higher than access WebShell, it is suitable to be calculated according to the IP of request user;
2) URI total degree is accessed:Within a sufficiently long cycle, the quantity of normal network address is accessed necessarily higher than access The quantity of WebShell network address, the attack type calculating that it is suitable to the conditional code returned according to request or fire wall detects;
3) access and frequency of failure ratio is returned in URI:Within a sufficiently long cycle, access normal network address and access failure For ratio necessarily less than the ratio for accessing the access failure of WebShell network address, it is suitable to conditional code calculating returned according to request;
4) access in URI by WAF interception request ratios:Due to WAF presence, within a sufficiently long cycle, access Normal network address is intercepted ratio and is necessarily intercepted ratio less than access WebShell network address.Even if WebShell can bypass WAF, always Intercepted situation is had, its attack type for being suitable to detect according to fire wall calculates;
5) whether the URI accessed has hit CDN:WebShell attack access will not necessarily hit CDN, and it is suitable to basis CDN hit conditions determine;Here, may be provided with hit is 1, misses as 0, may also be certainly and be arranged to other numerical value, this hair It is bright that this is not restricted.
6) required parameter change frequency in URI is accessed:The required parameter of WebShell attack access is not under normal circumstances Disconnected change, it is suitable to be calculated according to required parameter.
Specifically, can be by positive sample data or negative sample data point in the calculating process for carrying out multiple URI characteristic values Do not imported from journal file, data frame (DataFrame) is changed into according to the implication of each field.Then will be generated DataFrame polymerize according to URI, obtains the data row of each access data.Wherein client_ip, hit_status and query make New row collect_set (client_ip), collect_set (hit_ corresponding to collect_set methods generate respectively Status) given birth to respectively using collect_list methods with collect_set (query), resp_code and attack_type Into corresponding new row collect_list (resp_code) and collect_list (attack_type).Wherein, collect_ The main distinction of set and collect_list methods is that the former can carry out duplicate removal processing to cited data item content, and All data item then directly are set out by the latter.Here, data are accessed on client_ip etc., may for same URI There is multiple access, certain possible client_ip accessed several times is identical in this multiple access, therefore is counted to its quantity When need carry out duplicate removal processing.The form that new data row are generated according to URI polymerizations is as shown in table 2:
Table 2
Afterwards, you can utilize URI characteristic values corresponding to extraction in the new data row of each log feature in table 2.For example, utilize Data row collect_set (client_ip) extracts feature 1), generate new row client_ip_count;Arranged using data Collect_list (resp_code) or collect_list (attack_type) extractions feature 2), generate new row req_ count;Feature 3 is extracted using data row collect_list (resp_code)), generate new row resp_err_rate;Utilize Data row collect_list (attack_type) extracts feature 4), generate new row waf_block_rate;Arranged using data Collect_set (hit_status) extracts feature 5), generate new row hit_status;Utilize data row collect_set (query) feature 6 is extracted), generate new row query_count.It should be appreciated that the extraction for each characteristic value, art technology Personnel can be solved with sets itself formula or algorithm to corresponding each data row, the invention is not limited in this regard.Final The column data format of each URI characteristic values arrived can be as shown in table 3:
Table 3
After getting each URI multiple URI characteristic values, you can the plurality of URI characteristic values are configured to URI characteristic vectors, It can be sorted between these characteristic values according to any sortord, the invention is not limited in this regard.Assuming that above-mentioned the 6 of certain URI Individual feature URI characteristic values are respectively 50,60,40%, 30%, 1 and 5, then the characteristic vector constructed by above-mentioned sortord is { 50,70,40%, 30%, 1,5 }.
Then, in step S360, each URI characteristic vectors and its corresponding positive sample that are constructed according to positive sample data Mark the first positive sample collection of generation, and each URI characteristic vectors constructed according to negative sample data and its corresponding negative sample Mark the first negative sample collection of generation.
It is, first sample concentrate each sample correspond to a URI, the URI corresponding to URI characteristic vectors and Its sample identification, and each URI corresponds to a plurality of access log.First positive sample collection is that each URI generated by positive sample data is special The set of sign vector and its positive sample mark, the first negative sample collection be each URI characteristic vectors for being generated by negative sample data and its The set of negative sample mark.Usual positive sample mark can represent that negative sample mark can be represented with 0, naturally it is also possible to be adopted with 1 Positive and negative sample identification, the invention is not limited in this regard are represented with other numerical value.Table 4 shows first positive sample collection DataFrame examples, wherein, URI1-n is to have confirmed that the URI for normal access website, and its sample identification is 1.
Table 4
Then, in step S380, the first training set is generated according to the first positive sample collection and the first negative sample collection, and with this The URI characteristic vectors of each sample are input in first training set, using its sample identification as output, are instructed using pre-defined algorithm to first Practice collection to be trained, obtain the disaggregated model based on URI.
According to one embodiment of present invention, the generation on the first training set, can respectively by the first positive sample collection with Machine is divided into two groups, i.e. wDataFrame0 and wDataFrame1;First negative sample collection is also randomly divided into two groups, i.e., BDataFrame0 and bDataFrame1.Afterwards, optional one group of first positive sample collection and one group of first negative sample collection merge to obtain First training set, such as wDataFrame0 and bDataFrame0 is selected to merge to obtain the first training set.In addition, in order to ensure to align The equilibrium calculation of negative sample collection, negative sample collection can be aligned different weights is set, negative sample collection can generally set higher Weight.According to one embodiment, the weight of negative sample collection can be arranged to 2, the weight of positive sample collection is arranged to 1.
It should be appreciated that pre-defined algorithm can be existing arbitrary classification algorithm or regression algorithm, such as logistic regression algorithm, branch Vector machine classifier, Bayes classifier, maximum entropy classifiers etc. are held, it is specifically chosen to need depending on combining business scenario data, Parameter required for each algorithm etc. can voluntarily be set by those skilled in the art, and the present invention is without limitation.
According to another embodiment of the invention, the disaggregated model based on URI constructed can also be verified.Tool Body, collected according to the first positive sample collection and the checking of the first negative sample collection generation first, and the first checking is concentrated to the URI of each sample Characteristic vector is input in the disaggregated model based on URI constructed, and prediction obtains the sample identification of each sample.Afterwards, will be pre- The sample identification of each sample measured compares with its actual sample identification, calculates the accuracy of the disaggregated model based on URI. Here the first checking collection can choose another group of the first positive sample collection and another group of the first negative sample collection merges to obtain, and such as will WDataFrame1 and bDataFrame1 collects after merging as first checking.
After disaggregated model based on URI is built, you can doubtful quilt is identified from network access daily record with the model Webshell attacks the URI of website.Fig. 4 shows the detection side of Webshell attacks website according to an embodiment of the invention The flow chart of method 400, suitable for being performed in computing device, 200 such as in computing device in perform.As described in Figure 4, this method begins In step S420.
In the step s 420, a plurality of access log to be confirmed in scheduled time slot is obtained, wherein in every access log Including asking the associated access data of the URI and the URI of resource.Here access data are several access in step S320 Data, here is omitted.Scheduled time slot generally can be one day or two days, such as obtain the network access day in this couple of days Will, and confirm which is Webshell attack access daily records from these network access daily records;It can certainly be other periods Scope, this is not limited by the present invention.In addition, according to one embodiment of present invention, access day to be confirmed is got here After will, equally can first to these access logs carry out pretreatment operation, that is, filter out static path, white list path and/or Access log corresponding to non-Webshell suffix paths.
Then, in step S440, a plurality of access of the extraction for same URI from a plurality of access log to be confirmed Daily record, calculates multiple URI characteristic values of the URI according to the access data of a plurality of access log, and by the plurality of URI characteristic values It is configured to a URI characteristic vector.Wherein on multiple URI characteristic values and the computational methods of URI characteristic vectors, may be referred to walk Description in rapid S340, is repeated no more here.
Then, in step S460, each URI characteristic vectors of a plurality of access log to be confirmed is input to and are based on In URI disaggregated model, it is the URI corresponding to the URI characteristic vectors of negative sample mark to obtain wherein output result, and is marked It is designated as the doubtful URI that website is attacked by Webshell.
By the way that it is appreciated that in the output result of the disaggregated model based on URI, negative sample mark represents the sample institute above Corresponding URI is the doubtful URI that website is attacked by Webshell, and the URI that positive sample mark is then represented corresponding to the sample is just The URI of website is asked in frequentation.Therefore the sample corresponding to each negative sample mark (output result 0) is selected from output result URI, the as doubtful URI that website is attacked by Webshell.It can know by this method from recent network access daily record Do not go out some doubtful URI that website is attacked by Webshell, corresponding original log can be obtained by these URI, it is acquired Original log be doubtful Webshell attack access daily record.But generally it can also be included in these accessed daily records Scanner scanning access log, as shown in table 5, thus also need to sometimes by scanner access daily record with really WebShell access logs distinguish, that is, the URI and real WebShell that scanner is accessed accesses URI and distinguished.
Table 5
Therefore, according to one embodiment of present invention, a kind of disaggregated model based on access sequence can also be built, so as to Scanner can be accessed to the URI of website and the URI of website is attacked by Webshell.
Fig. 5 shows the construction method 500 of the disaggregated model according to an embodiment of the invention based on access sequence Flow chart, suitable for being performed in computing device, such as performed in computing device 200.As described in Figure 5, this method starts from step S520。
In step S520, it is the normal access log for accessing website as positive sample data to obtain a plurality of have confirmed that, with And it is a plurality of have confirmed that the access log that website is attacked for Webshell is denoted as negative sample data, wherein being wrapped in every access log Include the associated access data of the URI and the URI of request resource.The step is similar with step S320, simply in the structure of this model The access data for needing to use in construction method are requesting method, the conditional code of request return, beginning request time and request message Length this four, therefore can obtain and all access data, this four access data can also be only obtained, the present invention does not make to this Limitation.On the other details (including the operation such as pretreatment) of the step, step S320 description is referred to, it is no longer superfluous here State.
Then, in step S540, respectively from positive sample data and negative sample extracting data for a plurality of of same URI Access log, multiple access sequence characteristic values of the URI are calculated according to the access data of a plurality of access log, and will be the plurality of Access sequence characteristic value is configured to an access sequence characteristic vector.
Wherein, multiple access sequence characteristic values can include the one or more in following characteristics value:
A) GET/POST asks ratio:The behavior of scanner is relatively more fixed, and the sequence of general scan request also compares fixation, All GET requests, otherwise all POST requests, even if mixing request is also using the ratio of GET and POST request Relatively-stationary, it is suitable to be calculated according to requesting method.
B) successful ratio is asked:For corresponding scanner, otherwise in general request all successes, otherwise all lose Lose, its conditional code for being suitable to be returned according to request calculates.
C) access time interval average:The time interval that scanner accesses is more regular, and interval is more stable, and statistics connects The average value of the time interval of continuous access sequence, it is suitable to be determined according to beginning request time.
D) access time interval variance:The time interval that scanner accesses is more regular, and interval is more stable, and statistics connects The variance of the time interval of continuous access sequence, it is suitable to be determined according to beginning request time.
E) request message length average:The scan request load that scanner is sent is typically stable, counts request message The average of length, it is suitable to be determined according to request message length.
F) request message length variance:The scan request load that scanner is sent is typically stable, counts request message The variance of length, it is suitable to be determined according to request message length.
, equally can be by positive sample in the calculating process for carrying out multiple access sequence characteristic values similar to step S340 Data or negative sample data are imported from journal file respectively, and data frame (DataFrame) is changed into according to the implication of each field. Then the DataFrame generated is polymerize according to URI, obtains the data row of each access data, simply selected visit here Ask that data item is different from step S340.Wherein method, resp_code, start_time and req_bytes are used Collect_list methods generate corresponding data row collect_list (method), collect_list (resp_ respectively Code), collect_list (start_time) and collect_list (req_bytes).Generated according to URI polymerizations each The form of new data row is as shown in table 6:
Table 6
URI collect_list(method) collect_list(resp_code) collect_list(start_time) collect_list(req_bytes)
Afterwards, you can utilize access sequence characteristic value corresponding to extraction in the new data row that data are respectively accessed in table 6.Example Such as, feature a) is extracted using data row collect_list (method), it is (big absolutely in access log generates new row get_ratio Majority is that GET/POST is accessed, therefore other requests only statistics get_ratio can be neglected here);Utilize data row collec_ List (resp_code) extracts feature b), generate new row req_ok_ratio);Utilize data row collec_list (start_ Time feature c)) is extracted, generates new row req_interval_mean;Carried using data row collec_list (start_time) Feature d) is taken, generates new row req_interval_var;Feature e) is extracted using data row collec_list (req_bytes), The new row req_bytes_mean of generation;Feature f) is extracted using data row collec_list (req_bytes), generates new row req_bytes_varss.It is also to be understood that the extraction for each characteristic value, those skilled in the art can be public with sets itself Formula or algorithm solve to corresponding each data row, the invention is not limited in this regard.Each access sequence feature finally given The column data format of value can be as shown in table 7:
Table 7
After getting each URI multiple access sequence characteristic values, you can be configured to visit by the plurality of access sequence characteristic value Sequence signature vector is asked, the process is similar with the construction method of URI characteristic vectors, repeats no more here.
Then, in step S560, according to the access sequence characteristic vector of each URI in positive sample data and its it is corresponding just Sample identification generates the second positive sample collection, and according to the access sequence characteristic vector of each URI in negative sample data and its correspondingly Negative sample mark generation the second negative sample collection.
Similar with step S360, each sample in the second sample set corresponds to the visit corresponding to a URI, the URI Sequence signature vector and its sample identification are asked, and each URI corresponds to a plurality of access log.Second positive sample collection is by positive sample number The set identified according to each the access sequence characteristic vector and its positive sample of generation, the second negative sample collection is generated by negative sample data Each access sequence characteristic vector and its negative sample mark set.Similarly, positive sample mark can be represented with 1, negative sample Mark can use 0 mark, be certainly not limited to this.
Then, in step S580, the second training set is generated according to the second positive sample collection and the second negative sample collection, and with the The access sequence characteristic vector of each sample is input in two training sets, using its sample identification to export, using pre-defined algorithm to the Two training sets are trained, and obtain the disaggregated model based on access sequence.
The step is similar with step S380, wherein equally the second positive sample collection and the respectively of the generation on the second training set Two negative sample collection are randomly divided into two groups, and optional one of which the second positive sample collection and one group of second negative sample collection are used as after merging and stated Second training set.In addition, can equally set the different weights of positive negative sample here, it is 1, second such as to set the second positive sample weight Negative sample collection weight is 2.Carried out accurately moreover, can equally generate the second checking set pair and be somebody's turn to do the disaggregated model based on access sequence Property checking, the detail of model training and model checking refers to step S380 description, repeats no more here.
After disaggregated model based on access sequence is built, you can attack website by Webshell from doubtful with the model URI in identify scanner access website URI and Webshell attack website URI.Fig. 6 is shown according to of the invention another The flow chart of the detection method 600 of the Webshell attacks website of one embodiment, suitable for being performed in computing device, is such as being counted Performed in calculating 200 in equipment.As described in Figure 6, this method starts from step S620.
In step S620, the doubtful a plurality of original log attacked by Webshell corresponding to the URI of website is obtained, this is more Bar original log includes the access data that the URI and the URI of request resource are associated.Similarly, here can be first a plurality of to this Original log carries out pretreatment operation, repeats no more here.
Then, in step S640, extraction is for same URI a plurality of access log, root from a plurality of original log Multiple access sequence characteristic values of the URI are calculated according to the access data of a plurality of access log, and the plurality of access sequence is special Value indicative is configured to an access sequence characteristic vector.
Then, in step S660, each access sequence characteristic vector of a plurality of original log is input to based on access sequence In the disaggregated model of row, it is the URI corresponding to the access sequence characteristic vector of negative sample mark to obtain wherein output result, and will It is labeled as the URI that website is attacked by Webshell.
On the detail of each step in method 600, the description in method 200-500 is referred to, is repeated no more here. In addition, the input of the access sequence disaggregated model in method 600 is the doubtful visit attacked by Webshell corresponding to the URI of website Sequence signature vector is asked, its output result is that the URI that negative sample identifies corresponding to the interval scale sample is attacked by Webshell The URI of website, the URI represented for positive sample mark corresponding to the sample are the URI that scanner accesses website.Certainly, actual behaviour The screening of the disaggregated model based on URI can also be skipped in work, i.e., directly chooses original log to be confirmed, according to the original day Will is input in the disaggregated model based on access sequence after generating corresponding access sequence characteristic vector, the output so obtained As a result the sample for positive sample mark is typically the normal URI for accessing website, and output result is the sample of negative sample mark It is the doubtful URI that website is attacked by Webshell that can be approximately considered, so as to reduce attack data to a certain extent The portfolio of identification.
In summary, the present invention establishes a disaggregated model based on URI, by the model and method 400 from unknown Some doubtful URI by Webshell attacks website are have identified in the network access daily record of attribute, and then are confirmed corresponding The doubtful access log that website is attacked by Webshell.And in order to further be locked to attack logs, the present invention is built again A disaggregated model based on access sequence has been found, net doubtful is attacked by Webshell from those by the model and method 600 Some URI that website is really attacked by Webshell are have identified in the access log stood.Those are can determine that by these URI Real Webshell attack access daily records, table 8 it is exemplary show some confirmed as after method 400 and 600 by Original access log corresponding to the URI of Webshell attacks website.
Table 8
Technique according to the invention scheme, the log feature of WebShell attacks is extracted from massive logs, and use two The method of level classification learning excavates the website for being successfully acquired WebShell.Data volume subtracts original log significantly after treatment It is few, in practice operation experimental data can by more than original 100 ten thousand URI it is to be confirmed to be reduced to individual URI only more than 400 to be confirmed, So as to significantly reduce the workload of manual confirmation.
B9, the method as described in B8, wherein, the disaggregated model based on access sequence is also stored with the computing device, The model be suitable to from it is described it is doubtful by Webshell attack website URI in distinguish scanner access website URI and by Webshell attacks the URI of website, and it is suitable to build with the following method:Obtain a plurality of have confirmed that as the normal visit for accessing website Ask that daily record attacks the access log of website as negative sample number as positive sample data, and a plurality of have confirmed that for Webshell According to;Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively, according to a plurality of access The access data of daily record calculate multiple access sequence characteristic values of the URI, and the plurality of access sequence characteristic value is configured into one Bar access sequence characteristic vector;According to the access sequence characteristic vector of each URI in positive sample data and its corresponding positive sample mark Know the second positive sample collection of generation, and the access sequence characteristic vector according to each URI in negative sample data and its corresponding negative sample This mark generates the second negative sample collection;And the first training set is generated according to the second positive sample collection and the second negative sample collection, And using the access sequence characteristic vector of each sample in second training set as input, using its sample identification as output, using predetermined Algorithm is trained to second training set, obtains the disaggregated model based on access sequence.
B10, the method as described in B8 or B9, wherein, in addition to step:Obtain and described doubtful website is attacked by Webshell URI corresponding to a plurality of original log;Extraction is directed to same URI a plurality of access log from a plurality of original log, Calculate multiple access sequence characteristic values of the URI according to the access data of a plurality of access log, and by the plurality of access sequence Characteristic value is configured to an access sequence characteristic vector;It is and each access sequence characteristic vector of a plurality of original log is defeated Enter into the disaggregated model based on access sequence, obtain access sequence feature that wherein output result is negative sample mark to The corresponding URI of amount, and be marked as by the URI of Webshell attacks website.
B11, the method as described in B9, wherein, the access sequence characteristic value includes one kind or more in following characteristics value Kind:GET/POST requests ratio, the successful ratio of request, access time interval average, access time interval variance, request message Length average and request message length variance.
B12, the method as described in B11, wherein, GET/POST request ratios are suitable to be calculated according to requesting method;Ask into The conditional code that the ratio of work(is suitable to be returned according to request calculates;Access time interval average and access time interval variance are adapted to Determined according to request time is started;Request message length average and request message length variance are adapted to according to request message length It is determined that.
B13, the method as described in B8, wherein, in addition to pretreatment behaviour is carried out to a plurality of access log to be confirmed The step of making:Static path, white list path and non-Webshell suffix paths are filtered out from a plurality of access log respectively Corresponding access log.
B14, the method as any one of B8-B10, wherein, in the output of the disaggregated model based on URI, bear The URI that sample identification represents corresponding to the sample is the doubtful URI that website is attacked by Webshell, and positive sample mark represents the sample URI corresponding to this is the normal URI for accessing website;In the output of the disaggregated model based on access sequence, negative sample mark It is to be represented by the URI of Webshell attacks website, positive sample mark corresponding to the sample to know the URI represented corresponding to the sample URI is the URI that scanner accesses website.
Various technologies described herein can combine hardware or software, or combinations thereof is realized together.So as to the present invention Method and apparatus, or some aspects of the process and apparatus of the present invention or part can take embedded tangible media, such as soft The form of program code (instructing) in disk, CD-ROM, hard disk drive or other any machine readable storage mediums, Wherein when program is loaded into the machine of such as computer etc, and is performed by the machine, the machine becomes to put into practice this hair Bright equipment.
In the case where program code performs on programmable computers, computing device generally comprises processor, processor Readable storage medium (including volatibility and nonvolatile memory and/or memory element), at least one input unit, and extremely A few output device.Wherein, memory is arranged to store program codes;Processor is arranged to according to the memory Instruction in the described program code of middle storage, perform the construction method of the disaggregated model based on URI of the present invention, based on access The detection method of construction method and Webshell the attack website of the disaggregated model of sequence.
By way of example and not limitation, computer-readable medium includes computer-readable storage medium and communication media.Calculate Machine computer-readable recording medium includes computer-readable storage medium and communication media.Computer-readable storage medium storage such as computer-readable instruction, The information such as data structure, program module or other data.Communication media is typically modulated with carrier wave or other transmission mechanisms etc. Data-signal processed passes to embody computer-readable instruction, data structure, program module or other data including any information Pass medium.Any combination above is also included within the scope of computer-readable medium.
This place provide specification in, algorithm and show not with any certain computer, virtual system or other Equipment is inherently related.Various general-purpose systems can also be used together with the example of the present invention.As described above, construct this kind of Structure required by system is obvious.In addition, the present invention is not also directed to any certain programmed language.Should it is bright just, can To realize the content of invention described herein using various programming languages, and the description done above to language-specific be for Disclose the preferred forms of the present invention.
In the specification that this place provides, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice in the case of these no details.In some instances, known method, knot is not been shown in detail Structure and technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, Above in the description to the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:I.e. required guarantor The application claims of shield are than the feature more features that is expressly recited in each claim.More precisely, as following As claims reflect, inventive aspect is all features less than single embodiment disclosed above.Therefore, abide by Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself Separate embodiments as the present invention.
Those skilled in the art should be understood the module or unit or group of the equipment in example disclosed herein Part can be arranged in equipment as depicted in this embodiment, or alternatively can be positioned at and the equipment in the example In different one or more equipment.Module in aforementioned exemplary can be combined as a module or be segmented into addition multiple Submodule.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any Combination is disclosed to all features disclosed in this specification (including adjoint claim, summary and accompanying drawing) and so to appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification (including adjoint power Profit requires, summary and accompanying drawing) disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation Replace.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning arbitrarily combination can use.
In addition, be described as herein can be by the processor of computer system or by performing for some in the embodiment The method or the combination of method element that other devices of the function are implemented.Therefore, have and be used to implement methods described or method The processor of the necessary instruction of element forms the device for implementing this method or method element.In addition, device embodiment Element described in this is the example of following device:The device is used to implement as in order to performed by implementing the element of the purpose of the invention Function.
As used in this, unless specifically stated so, come using ordinal number " first ", " second ", " the 3rd " etc. Description plain objects are merely representative of the different instances for being related to similar object, and are not intended to imply that the object being so described must Must have the time it is upper, spatially, in terms of sequence or given order in any other manner.
Although describing the present invention according to the embodiment of limited quantity, above description, the art are benefited from Interior technical staff is bright just, in the scope of the present invention thus described, it can be envisaged that other embodiments.Additionally, it should be noted that The language that is used in this specification primarily to readable and teaching purpose and select, rather than in order to explain or limit Determine subject of the present invention and select.Therefore, in the case of without departing from the scope and spirit of the appended claims, for this Many modifications and changes will be apparent from for the those of ordinary skill of technical field.For the scope of the present invention, to this The done disclosure of invention is illustrative and be not restrictive, and it is intended that the scope of the present invention be defined by the claims appended hereto.

Claims (10)

1. a kind of construction method of the disaggregated model based on URI, is performed in computing device, normal access website is adapted for distinguishing between URI and the doubtful URI that website is attacked by Webshell, this method include:
It is the normal access log for accessing website as positive sample data to obtain a plurality of have confirmed that respectively, and it is a plurality of have confirmed that for The access log of Webshell attacks website is as negative sample data, wherein every access log includes the URI for asking resource And the access data associated with the URI;
Same URI a plurality of access log is directed to from positive sample data and negative sample extracting data respectively, according to a plurality of visit Ask that the access data of daily record calculate multiple URI characteristic values of the URI, and the plurality of URI characteristic values are configured to a URI feature Vector;
First positive sample collection is generated according to the URI characteristic vectors of each URI in positive sample data and its corresponding positive sample mark, with And according to the URI characteristic vectors of each URI in negative sample data and its corresponding negative sample mark the first negative sample collection of generation;And
First training set is generated according to the first positive sample collection and the first negative sample collection, and with each sample in first training set URI characteristic vectors for input, using its sample identification for export, first training set is trained using pre-defined algorithm, Obtain the disaggregated model based on URI.
2. the method for claim 1, wherein the access data of the access log include following parameter in one kind or It is a variety of:
Ask the conditional code, CDN hit conditions, fire wall detection that the IP of user, requesting method, request return attack type, Required parameter, start request time and request message length.
3. method as claimed in claim 2, wherein, the multiple URI characteristic values include one kind or more in following characteristics value Kind:
Access in URI client ip quantity, access URI total degree, access URI and return to frequency of failure ratio, access in URI Whether hit CDN is had by WAF interception requests ratio, the URI accessed, accesses required parameter change frequency in URI.
4. method as claimed in claim 3, wherein,
The client ip quantity of the access URI is suitable to be calculated according to the IP of request user;
The attack type calculating that the total degree of the access URI is suitable to the conditional code returned according to request or fire wall detects;
The conditional code that frequency of failure ratio is suitable to return according to request is returned in the access URI to calculate;
Calculated in the access URI by the attack type that fire wall interception request ratio is suitable to be detected according to fire wall;
Whether the URI of the access has hit CDN to be suitable to be determined according to CDN hit conditions;And
Required parameter change frequency is suitable to be calculated according to required parameter in the access URI.
5. method as claimed in claim 4, wherein, the access data according to a plurality of access log calculate the URI's The step of multiple URI characteristic values, includes:
Positive sample data and negative sample data are converted into data frame according to the implication of each field;And
The data frame is polymerize according to URI, obtains the data row of each access data, and the URI corresponding to extraction from each data row Characteristic value;
Wherein, IP, CDN hit condition of user, required parameter is asked to be suitable to using collect_set methods generation data row, The attack type of the conditional code and fire wall detection returned is asked to be suitable to using collect_list methods generation data row.
6. the method for claim 1, wherein also include step:
According to the first positive sample collection and the first negative sample collection generation the first checking collection;
The URI characteristic vectors of each sample are concentrated to be input in the disaggregated model based on URI the first checking, prediction obtains each The sample identification of sample;And
The sample identification for predicting obtained each sample is compared with its actual sample identification, calculates the classification based on URI The accuracy of model.
7. in method as claimed in claim 6, wherein, first training set and the first checking collection are suitable to according to following methods Generation:
The first positive sample collection and the first negative sample collection are randomly divided into two groups respectively;And
Optional one of which the first positive sample collection and one group of first negative sample collection are used as first training set after merging, and will be another One group of first positive sample collection and another group of the first negative sample collection collect after merging as the described first checking.
8. a kind of detection method of Webshell attacks website, suitable for being performed in computing device, is stored in the computing device Just like the disaggregated model based on URI any one of claim 1-7, this method includes:
The a plurality of access log to be confirmed in scheduled time slot is obtained, wherein every access log includes the URI for asking resource And the access data associated with the URI;
Extraction is directed to same URI a plurality of access log from a plurality of access log to be confirmed, according to a plurality of access The access data of daily record calculate multiple URI characteristic values of the URI, and by the plurality of URI characteristic values be configured to a URI feature to Amount;And
Each URI characteristic vectors of a plurality of access log to be confirmed are input in the disaggregated model based on URI, obtained To the URI corresponding to the URI characteristic vectors that wherein output result is negative sample mark, and it is marked as doubtful by Webshell Attack the URI of website.
9. a kind of computing device, including:
One or more processors;
Memory;And
One or more programs, wherein one or more of program storages are in the memory and are configured as by described one Individual or multiple computing devices, one or more of programs include being used to perform according in claim 1-7 or claim 8 The instruction of either method in described method.
10. a kind of computer-readable recording medium for storing one or more programs, one or more of programs include instruction, The instruction is when executed by a computing apparatus so that the computing device is according to claim 1-7 or claim 8 Method in either method.
CN201711276201.4A 2017-12-06 2017-12-06 Construction method of classification model based on URI and detection method of Webshell attack website Active CN107888616B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711276201.4A CN107888616B (en) 2017-12-06 2017-12-06 Construction method of classification model based on URI and detection method of Webshell attack website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711276201.4A CN107888616B (en) 2017-12-06 2017-12-06 Construction method of classification model based on URI and detection method of Webshell attack website

Publications (2)

Publication Number Publication Date
CN107888616A true CN107888616A (en) 2018-04-06
CN107888616B CN107888616B (en) 2020-06-05

Family

ID=61773179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711276201.4A Active CN107888616B (en) 2017-12-06 2017-12-06 Construction method of classification model based on URI and detection method of Webshell attack website

Country Status (1)

Country Link
CN (1) CN107888616B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763470A (en) * 2018-05-29 2018-11-06 北京白山耘科技有限公司 A kind of method and device detecting dangerous information by text message
CN108920959A (en) * 2018-07-21 2018-11-30 杭州安恒信息技术股份有限公司 A kind of webshell detection method based on Bayesian model optimization
CN109101527A (en) * 2018-06-21 2018-12-28 中国科学院信息工程研究所 A kind of magnanimity security log information filter method and device
CN109508542A (en) * 2018-10-26 2019-03-22 国家计算机网络与信息安全管理中心江苏分中心 WEB method for detecting abnormality, system and server under big data environment
CN109525551A (en) * 2018-10-07 2019-03-26 杭州安恒信息技术股份有限公司 A method of the CC based on statistical machine learning attacks protection
CN109600382A (en) * 2018-12-19 2019-04-09 北京知道创宇信息技术有限公司 Webshell detection method and device, HMM model training method and device
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN110351299A (en) * 2019-07-25 2019-10-18 新华三信息安全技术有限公司 A kind of network connection detection method and device
CN110602137A (en) * 2019-09-25 2019-12-20 光通天下网络科技股份有限公司 Malicious IP and malicious URL intercepting method, device, equipment and medium
CN110868419A (en) * 2019-11-18 2020-03-06 杭州安恒信息技术股份有限公司 Method and device for detecting WEB backdoor attack event and electronic equipment
CN110933115A (en) * 2019-12-31 2020-03-27 上海观安信息技术股份有限公司 Analysis object behavior abnormity detection method and device based on dynamic session
CN110968564A (en) * 2018-09-28 2020-04-07 阿里巴巴集团控股有限公司 Data processing method and training method of data state prediction model
CN111107096A (en) * 2019-12-27 2020-05-05 杭州迪普科技股份有限公司 Web site safety protection method and device
CN111600894A (en) * 2020-05-20 2020-08-28 新华三信息安全技术有限公司 Network attack detection method and device
CN113132329A (en) * 2019-12-31 2021-07-16 深信服科技股份有限公司 WEBSHELL detection method, device, equipment and storage medium
WO2021169239A1 (en) * 2020-02-24 2021-09-02 网宿科技股份有限公司 Crawler data recognition method, system and device
CN113779571A (en) * 2020-06-10 2021-12-10 中国电信股份有限公司 WebShell detection device, WebShell detection method and computer-readable storage medium
CN113783889A (en) * 2021-09-22 2021-12-10 南方电网数字电网研究院有限公司 Firewall control method for linkage access of network layer and application layer and firewall thereof
WO2022117063A1 (en) * 2020-12-03 2022-06-09 百果园技术(新加坡)有限公司 Method and apparatus for training isolation forest, and method and apparatus for recognizing web crawler

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102801698A (en) * 2011-12-20 2012-11-28 北京安天电子设备有限公司 Uniform resource locator (URL) request time sequence-based detection method and system for malicious codes
CN103684896A (en) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
CN104468477A (en) * 2013-09-16 2015-03-25 杭州迪普科技有限公司 WebShell detection method and system
CN104766014A (en) * 2015-04-30 2015-07-08 安一恒通(北京)科技有限公司 Method and system used for detecting malicious website
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not
CN106961419A (en) * 2017-02-13 2017-07-18 深信服科技股份有限公司 WebShell detection methods, apparatus and system
CN107332848A (en) * 2017-07-05 2017-11-07 重庆邮电大学 A kind of exception of network traffic real-time monitoring system based on big data
CN107404497A (en) * 2017-09-05 2017-11-28 成都知道创宇信息技术有限公司 A kind of method that WebShell is detected in massive logs

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102801698A (en) * 2011-12-20 2012-11-28 北京安天电子设备有限公司 Uniform resource locator (URL) request time sequence-based detection method and system for malicious codes
CN103684896A (en) * 2012-09-07 2014-03-26 中国科学院计算机网络信息中心 Method of detecting website cheating based on domain name resolution characteristics
CN104468477A (en) * 2013-09-16 2015-03-25 杭州迪普科技有限公司 WebShell detection method and system
CN104766014A (en) * 2015-04-30 2015-07-08 安一恒通(北京)科技有限公司 Method and system used for detecting malicious website
CN105956472A (en) * 2016-05-12 2016-09-21 宝利九章(北京)数据技术有限公司 Method and system for identifying whether webpage includes malicious content or not
CN106961419A (en) * 2017-02-13 2017-07-18 深信服科技股份有限公司 WebShell detection methods, apparatus and system
CN107332848A (en) * 2017-07-05 2017-11-07 重庆邮电大学 A kind of exception of network traffic real-time monitoring system based on big data
CN107404497A (en) * 2017-09-05 2017-11-28 成都知道创宇信息技术有限公司 A kind of method that WebShell is detected in massive logs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
石刘洋 等: "基于Web日志的Webshell检测方法研究", 《信息安全研究》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019228158A1 (en) * 2018-05-29 2019-12-05 北京白山耘科技有限公司 Method and apparatus for detecting dangerous information by means of text information, medium, and device
CN108763470A (en) * 2018-05-29 2018-11-06 北京白山耘科技有限公司 A kind of method and device detecting dangerous information by text message
CN109101527A (en) * 2018-06-21 2018-12-28 中国科学院信息工程研究所 A kind of magnanimity security log information filter method and device
CN108920959A (en) * 2018-07-21 2018-11-30 杭州安恒信息技术股份有限公司 A kind of webshell detection method based on Bayesian model optimization
CN108920959B (en) * 2018-07-21 2020-12-01 杭州安恒信息技术股份有限公司 Webshell detection method based on Bayesian model optimization
CN110968564B (en) * 2018-09-28 2023-04-25 阿里巴巴集团控股有限公司 Data processing method and training method of data state prediction model
CN110968564A (en) * 2018-09-28 2020-04-07 阿里巴巴集团控股有限公司 Data processing method and training method of data state prediction model
CN109525551A (en) * 2018-10-07 2019-03-26 杭州安恒信息技术股份有限公司 A method of the CC based on statistical machine learning attacks protection
CN109508542B (en) * 2018-10-26 2019-11-22 国家计算机网络与信息安全管理中心江苏分中心 WEB method for detecting abnormality, system and server under big data environment
CN109508542A (en) * 2018-10-26 2019-03-22 国家计算机网络与信息安全管理中心江苏分中心 WEB method for detecting abnormality, system and server under big data environment
CN109600382A (en) * 2018-12-19 2019-04-09 北京知道创宇信息技术有限公司 Webshell detection method and device, HMM model training method and device
CN109600382B (en) * 2018-12-19 2021-07-13 北京知道创宇信息技术股份有限公司 Webshell detection method and device and HMM model training method and device
CN110175278A (en) * 2019-05-24 2019-08-27 新华三信息安全技术有限公司 The detection method and device of web crawlers
CN110351299A (en) * 2019-07-25 2019-10-18 新华三信息安全技术有限公司 A kind of network connection detection method and device
CN110602137A (en) * 2019-09-25 2019-12-20 光通天下网络科技股份有限公司 Malicious IP and malicious URL intercepting method, device, equipment and medium
CN110868419A (en) * 2019-11-18 2020-03-06 杭州安恒信息技术股份有限公司 Method and device for detecting WEB backdoor attack event and electronic equipment
CN111107096A (en) * 2019-12-27 2020-05-05 杭州迪普科技股份有限公司 Web site safety protection method and device
CN110933115A (en) * 2019-12-31 2020-03-27 上海观安信息技术股份有限公司 Analysis object behavior abnormity detection method and device based on dynamic session
CN113132329A (en) * 2019-12-31 2021-07-16 深信服科技股份有限公司 WEBSHELL detection method, device, equipment and storage medium
CN110933115B (en) * 2019-12-31 2022-04-29 上海观安信息技术股份有限公司 Analysis object behavior abnormity detection method and device based on dynamic session
WO2021169239A1 (en) * 2020-02-24 2021-09-02 网宿科技股份有限公司 Crawler data recognition method, system and device
CN111600894A (en) * 2020-05-20 2020-08-28 新华三信息安全技术有限公司 Network attack detection method and device
CN111600894B (en) * 2020-05-20 2023-05-16 新华三信息安全技术有限公司 Network attack detection method and device
CN113779571A (en) * 2020-06-10 2021-12-10 中国电信股份有限公司 WebShell detection device, WebShell detection method and computer-readable storage medium
CN113779571B (en) * 2020-06-10 2024-04-26 天翼云科技有限公司 WebShell detection device, webShell detection method and computer readable storage medium
WO2022117063A1 (en) * 2020-12-03 2022-06-09 百果园技术(新加坡)有限公司 Method and apparatus for training isolation forest, and method and apparatus for recognizing web crawler
CN113783889A (en) * 2021-09-22 2021-12-10 南方电网数字电网研究院有限公司 Firewall control method for linkage access of network layer and application layer and firewall thereof

Also Published As

Publication number Publication date
CN107888616B (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN107888616A (en) The detection method of construction method and Webshell the attack website of disaggregated model based on URI
US11190562B2 (en) Generic event stream processing for machine learning
Harinahalli Lokesh et al. Phishing website detection based on effective machine learning approach
Sheikhan et al. Intrusion detection using reduced-size RNN based on feature grouping
Ali Alheeti et al. Intelligent intrusion detection in external communication systems for autonomous vehicles
CN111614599B (en) Webshell detection method and device based on artificial intelligence
CN107729532A (en) A kind of resume matching process and computing device
US20130042306A1 (en) Determining machine behavior
US11593475B2 (en) Security information analysis device, security information analysis method, security information analysis program, security information evaluation device, security information evaluation method, security information analysis system, and recording medium
CN106992981B (en) Website backdoor detection method and device and computing equipment
CN107003976A (en) Based on active rule can be permitted determine that activity can be permitted
Chu et al. Bot or human? A behavior-based online bot detection system
CN110830445B (en) Method and device for identifying abnormal access object
CN110855648B (en) Early warning control method and device for network attack
WO2021068563A1 (en) Sample date processing method, device and computer equipment, and storage medium
CN111224941B (en) Threat type identification method and device
Abawajy et al. Hybrid consensus pruning of ensemble classifiers for big data malware detection
CN111680167A (en) Service request response method and server
Eldos et al. On the KDD'99 Dataset: Statistical Analysis for Feature Selection
Hajdu et al. Use of artificial neural networks to identify fake profiles
CN115314239A (en) Analysis method and related equipment for hidden malicious behaviors based on multi-model fusion
RU2745362C1 (en) System and method of generating individual content for service user
CN113822684A (en) Heikou user recognition model training method and device, electronic equipment and storage medium
CN114915434A (en) Network agent detection method, device, storage medium and computer equipment
She et al. An improved malicious code intrusion detection method based on target tree for space information network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 311501, Unit 1, Building 5, Courtyard 1, Futong East Street, Chaoyang District, Beijing 100102

Applicant after: Beijing Zhichuangyu Information Technology Co., Ltd.

Address before: 100097 Jinwei Building 803, 55 Lanindichang South Road, Haidian District, Beijing

Applicant before: Beijing Knows Chuangyu Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant