Detailed Description
The existing method for detecting the URL is to detect the URL by the server according to a safety rule set manually. On one hand, however, means for hackers to attack the network by using the URL vary widely, and manually established security rules are difficult to cover all the attacking means; on the other hand, manually established security rules often lag behind emerging means of attack.
Therefore, in one or more embodiments of the present specification, a plurality of URLs are obtained, parameters in the URLs are extracted, a feature vector corresponding to each parameter is determined, and an isolation forest IsolationForest model is constructed according to the feature vectors corresponding to the parameters. As is well known to those skilled in the art, the isolated forest model is an anomaly detection model, and can be used to detect whether a URL is anomalous, where the anomalous URL is often a URL sent by a hacker, and the server can refuse to resolve the anomalous URL, thereby avoiding the hacker from attacking the URL.
It should be noted that, an isolated forest model can be constructed according to the feature vectors corresponding to the parameters in the URLs, because in practice, the main means for hackers to attack the server by using the URLs is to add illegal fields in the parameters of the URLs. That is, there is a significant difference between the feature vector of the parameter in the normal URL and the feature vector of the parameter in the abnormal URL. The characteristics of parameters in an abnormal URL are often rare and clearly distinguished from the characteristics of parameters in a normal URL.
Based on this, the core idea of the technical scheme described in this specification is to use feature vectors of parameters in a plurality of known URLs as data samples to construct an isolated forest model. The isolated forest model can judge whether the URL is abnormal or not according to the characteristic vector of the parameter in the URL to be detected.
In order to make the technical solutions in the present specification better understood, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive step through the embodiments of the present description shall fall within the scope of protection of the present description.
The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a model training method provided in an embodiment of the present specification, including the following steps:
s100: several URLs are obtained.
In the embodiments of the present specification, the execution subject may be a server or other device having data processing capability, and the execution subject is a server as an example hereinafter.
It is well known that for a URL, the parameters in the URL may contain some information that the user (possibly a hacker) enters.
For example, "http:// server/path/document? name1 value1& name2 value2 "is a typical structure of a URL,"? The "latter data is the parameter. More than one parameter may be contained in a URL, with different parameters typically separated by "&", each parameter having a parameter name and parameter values. The parameter values are typically entered by a user. In this example, the URL contains two parameters, "name 1 value 1" indicates that the parameter value of the parameter with the parameter name1 is value 1; "name 2 value 2" indicates that the parameter value of the parameter with the parameter name of 2 is value 2.
Hackers sometimes add an abnormal illegal field in the parameters of the URL to attack the server. For example, if a good user logs in to the server, the normal URL sent is as follows:
"http:// server/path/document? The name1 is user1& name2 is password1 ", wherein the parameter value of the first parameter is user name" user1 ", the parameter value of the second parameter is password" password1 ", the server analyzes the URL, and after the user name and the password are verified to pass, the user logs in to the server.
When the hacker wants to impersonate the user "user 1" to log in the server, the hacker can use the means of SQL injection attack to send the following abnormal URLs to the server:
"http:// server/path/document? The name1 is user1& name2 is "' or 1 is 1", wherein a parameter value of the first parameter is a user name "user 1", a parameter value of the second parameter is not a password corresponding to the user name, but an illegal field "or 1 is 1", and due to the inherent characteristic of SQL syntax, when the server cannot verify the password of the user according to the illegal field, the illegal field is analyzed as an executable code by the server and executed by a hacker, so that the hacker can log in an account of the user "user 1" without the password and operate data of the user.
In step S200, the URLs obtained by the server generally include partially normal URLs and partially abnormal URLs. And the abnormal URL is rare, so that the abnormal URL accounts for a lower proportion of the URLs.
S102: for each URL, parameters in the URL are extracted.
In this embodiment, the parameter in the server extraction URL may be a parameter name and a parameter value included in the extraction URL, or may be a parameter value for extracting only the parameter in the URL.
In addition, the server may extract all parameters in each URL, or may extract some parameters in the URL.
In practical application, the probability of occurrence of some parameter names is low, and hackers rarely add illegal fields to the parameter values corresponding to the parameter names with low probability of occurrence, so that the server does not extract the parameter values corresponding to the parameter names with low probability of occurrence.
Specifically, the server may determine, for each URL, a parameter whose parameter name satisfies a specified condition, among parameters contained in the URL; for each determined parameter, a parameter value for that parameter is extracted. Wherein the specified condition may be that the occurrence probability of the parameter name is greater than a specified probability value. Therefore, the parameters with lower occurrence probability can be filtered, and the burden of the server on processing data in the subsequent steps is reduced.
S104: and determining a feature vector corresponding to each extracted parameter.
In this embodiment of the present specification, for each extracted parameter, an N-dimensional feature vector corresponding to the parameter may be determined according to a parameter value of the parameter; n is a natural number greater than 0.
The dimension of the parameter corresponding to the feature vector may include at least one of a total number of characters, a total number of letters, a total number of numbers, a total number of specific symbols, a number of different characters, a number of different letters, a number of different numbers, and a number of different specific symbols included in a parameter value of the parameter.
With the URL "http:// server/path/document? For example, the name1 is user1 and name2 is password1 ", and the parameter value of the parameter name1 in the URL is user1, and the parameter value includes 5 total characters, 4 total letters, 1 total number of numbers, 0 total number of specific symbols, 5 number of different characters, 4 number of different letters, 1 number of different numbers, and 0 number of different specific symbols. Then, the feature vector corresponding to the parameter name1 may be (5, 4, 1, 0, 5, 4, 1, 0).
Further, the values of each dimension of the feature vector may be normalized. Here again, following the above example, the equation can be based
And normalizing the 8 feature vector values corresponding to the parameter name 1. Wherein x represents a characteristic vector value, z represents the total number of characters contained in the parameter name1, and y represents a numerical value constructed after x is subjected to normalization processing. Then, the parameter name1 contains a feature vector of (5/5, 4/5, 1/5, 0/5, 5/5, 4/5, 1/5, 0/5), i.e., (1, 0.8, 0.2, 0, 1, 0.8, 0.8, 0).
S106: and constructing an isolated forest model according to the characteristic vectors corresponding to the parameters respectively.
In the embodiment of the description, an isolation forest algorithm is adopted, and an isolation forest model is constructed according to the feature vectors corresponding to the parameters respectively, and is used for detecting whether the URL is abnormal or not. And normal or abnormal marking on the characteristic vectors corresponding to the parameters is not needed.
The idea of the isolated forest algorithm is briefly introduced here. Referring to fig. 2a, the 10 dots shown in fig. 2a include hollow dots and solid dots, the number of the hollow dots is large (8 dots) and the distribution is concentrated, and the number of the solid dots is small (2 dots) and the distribution is dispersed. The hollow dots may be regarded as normal dots, and the solid dots as abnormal dots. That is, outliers are just a few and outliers. Then the following operations are carried out:
division 1: a line appears randomly, dividing the points in fig. 2a into parts a and B, resulting in fig. 2B.
Division 2: aiming at the part A, a line continues to appear randomly, and points in the part A are divided into a part C and a part D; also, for part B, a line appears randomly, dividing the point in part B into part E and part F, as in fig. 2 c.
The division continues until the plane shown in fig. 2a is divided into 10 sections, each section containing only 1 point, i.e. each point is divided into a dedicated section (if only one point is included in a section, this section is the dedicated section of this point). Obviously, the solid dots are easier and faster to be scribed into the dedicated section, as shown in fig. 2b, the solid dots in the upper right corner have already been scribed into the dedicated section (section F). That is, the more easily a certain point is scratched into the exclusive portion, the more abnormal this point is.
Based on the above idea, in the isolated forest algorithm, there are S classification trees (which may be binary trees), and for each binary tree, the points shown in fig. 2a are placed in the root node, and from the root node, the condition of each bifurcation is random (i.e., each time the point is divided by a randomly occurring line), and in the binary tree, the earlier the point that falls into the leaf node is, the higher the probability of the abnormality is.
Taking the isolated forest algorithm as an example, the isolated forest model is constructed according to the feature vectors corresponding to the parameters in step S106 for brief description.
The isolated forest includes S binary trees (itrees), and for each iTree, the process of training the iTree can be described as follows:
step one, randomly selecting M feature vectors in each feature vector, and putting the M feature vectors into a root node of the iTree;
secondly, randomly assigning a dimension (designated dimension) in N dimensions of the feature vector, and randomly assigning a value in the value of the designated dimension as a cutting value; the cutting value is between the maximum value and the minimum value in the values of the appointed dimension of the M characteristic vectors;
thirdly, dividing the M eigenvectors into two parts according to the cutting value, wherein one part is the eigenvector with the value of the designated dimension not less than the cutting value, and the other part is the eigenvector with the value of the designated dimension less than the cutting value;
and fourthly, recursively executing the second step and the third step until the iTree reaches a specified height or only one feature vector is put on leaf nodes of the iTree. Wherein the specified height can be set as desired, typically log 2M.
Through the four steps, an iTree can be trained.
When training the next iTree, M feature vectors may be randomly selected from all feature vectors or M feature vectors may be randomly selected from unselected feature vectors in the first step.
And repeatedly executing the four steps to obtain S trained iTrees to form an isolation forest model.
Fig. 3 is a flowchart of a method for detecting a URL according to an embodiment of the present disclosure, including the following steps:
s300: and acquiring the URL.
S302: and extracting the parameters in the URL.
S304: and determining a feature vector corresponding to each extracted parameter.
S306: and inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated forest model so as to perform anomaly detection on the URL.
The URL in fig. 3 is a URL to be detected. For the description of steps S300 to S304, refer to steps S100 to S104, and are not described again.
In step S306, the feature vectors corresponding to the parameters may be input into the isolated forest model to obtain model output results corresponding to the parameters, and whether there is an abnormal parameter in the parameters is determined according to the model output results corresponding to the parameters.
Further, for each parameter, inputting the feature vector corresponding to the parameter into an isolated forest model, classifying the feature vector corresponding to the parameter through each classification tree in the isolated forest model, and determining the average height of leaf nodes where the feature vector corresponding to the parameter falls in each classification tree as a model output result corresponding to the parameter; then, for each parameter, if the model output result corresponding to the parameter is smaller than a specified threshold, determining that the parameter is abnormal, and if the model output result corresponding to the parameter is not smaller than the specified threshold, determining that the parameter is normal; when any parameter is determined to be abnormal, determining abnormal parameters in the parameters; and when all the parameters are determined to be normal, determining that abnormal parameters do not exist in all the parameters.
Through the method shown in fig. 1 and fig. 3, an isolation forest model is constructed according to the feature vectors of the parameters in the URL, so that the server can detect the received URL through the isolation forest model, and if the received URL is determined to be abnormal, the server can refuse to analyze the URL, thereby avoiding hacking and improving network security.
In addition, through the embodiment of the specification, potential network attack means can be found. Specifically, whether a certain URL is abnormal or not can be determined through an isolation forest model, if the URL is abnormal, the fact that the parameter value of the parameter is abnormal is meant, the abnormal parameter value can prompt a worker to analyze an attack means adopted by a hacker, and the worker can conveniently perfect the safety rule.
Based on the model training method shown in fig. 1, an embodiment of the present specification further provides a model training apparatus, as shown in fig. 4, including:
an obtaining module 401, which obtains a plurality of Uniform Resource Locators (URLs);
an extracting module 402, for each URL, extracting parameters in the URL;
a determining module 403, configured to determine, for each extracted parameter, a feature vector corresponding to the parameter;
and the processing module 404 is configured to construct an isolation forest model according to the feature vectors corresponding to the parameters, wherein the isolation forest model is used for detecting whether the URL is abnormal.
The extraction module is used for determining parameters of which parameter names meet specified conditions in the parameters contained in each URL; for each determined parameter, a parameter value for that parameter is extracted.
The determining module is used for determining an N-dimensional feature vector corresponding to each extracted parameter according to the parameter value of the parameter; n is a natural number greater than 0.
The dimensions of the N-dimensional feature vector specifically include: the parameter value of the parameter contains at least one of a total number of characters, a total number of letters, a total number of numerals, a total number of symbols, a number of different characters, a number of different letters, a number of different numerals, and a number of different symbols.
Based on the method for detecting a URL shown in fig. 3, an embodiment of the present specification further provides an apparatus for detecting a URL, as shown in fig. 5, including:
the obtaining module 501 obtains a URL;
an extracting module 502, which extracts the parameters in the URL;
a determining module 503, configured to determine, for each extracted parameter, a feature vector corresponding to the parameter;
the anomaly detection module is used for inputting the characteristic vectors corresponding to the parameters into a pre-constructed isolated Forest Isolation Forest model so as to perform anomaly detection on the URL; the isolated forest model is constructed according to the model training method.
The anomaly detection module inputs the characteristic vectors corresponding to the parameters into a pre-constructed isolated Forest Isolation Forest model, and constructs model output results corresponding to the parameters; judging whether abnormal parameters exist in the parameters according to the model output results corresponding to the parameters respectively; if yes, determining that the URL is abnormal; otherwise, determining that the URL is normal.
The anomaly detection module inputs the feature vector corresponding to each parameter into a pre-constructed isolated forest model, classifies the feature vector corresponding to the parameter through each classification tree in the isolated forest model, and determines the average height of leaf nodes, into which the feature vector corresponding to the parameter falls, in each classification tree as a model output result corresponding to the parameter; and for each parameter, if the model output result corresponding to the parameter is smaller than a specified threshold value, determining that the parameter is abnormal, and if the model output result corresponding to the parameter is not smaller than the specified threshold value, determining that the parameter is normal.
Based on the model training method shown in fig. 2, the present specification further provides a model training apparatus, as shown in fig. 6, including one or more processors and a memory, where the memory stores a program and is configured to be executed by the one or more processors to perform the following steps:
acquiring a plurality of Uniform Resource Locators (URLs);
for each URL, extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
and constructing an Isolation Forest model according to the characteristic vectors corresponding to the parameters respectively, wherein the Isolation Forest model is used for detecting whether the URL is abnormal or not.
Based on the method for detecting a URL shown in fig. 3, the present specification embodiment further provides an apparatus for detecting a URL, as shown in fig. 7, including one or more processors and a memory, where the memory stores a program and is configured to be executed by the one or more processors to perform the following steps:
acquiring a URL;
extracting parameters in the URL;
determining a feature vector corresponding to each extracted parameter;
inputting the characteristic vectors corresponding to the parameters into a pre-constructed Isolation Forest model to perform anomaly detection on the URL; the isolated forest model is constructed according to the model training method.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatuses shown in fig. 6 and 7, since they are substantially similar to the method embodiments, the description is simple, and in relation to the description, reference may be made to part of the description of the method embodiments.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital character system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate a dedicated integrated circuit chip. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.