CN114372267A

CN114372267A - Malicious webpage identification and detection method based on static domain, computer and storage medium

Info

Publication number: CN114372267A
Application number: CN202111340418.3A
Authority: CN
Inventors: 余翔湛; 刘立坤; 陈巍; 史建焘; 葛蒙蒙; 叶麟; 于喜东; 王永强; 冯帅; 赵跃; 王久金; 宋赟祖; 郭明昊; 胡智超; 苗钧重; 刘凡; 李精卫; 石开宇; 韦贤葵; 孔德文
Original assignee: Harbin Institute of Technology; Shanghai Pudong Development Bank Co Ltd
Current assignee: Harbin Institute of Technology; Shanghai Pudong Development Bank Co Ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-04-19
Anticipated expiration: 2041-11-12
Also published as: CN114372267B

Abstract

The invention provides a malicious webpage identification and detection method based on a static domain, a computer and a storage medium, and belongs to the technical field of webpage identification and detection. Monitoring webpage flow in real time, and extracting a URL (uniform resource locator) address of an HTTP head; step two, matching the URL address with the URL address stored in the blacklist library; step three, analyzing the webpage flow failed in matching; step four, crawling JS and CSS files in the analyzed webpage flow; step five, extracting the webpage fingerprint of the target webpage; step six, identifying webpage flow; step seven, comparing the URL addresses of the two webpages; if the URL addresses are the same, the webpage in the flow is a normal webpage, and a matching log is stored; if the URL addresses are different, the webpage in the flow is indicated to be a malicious webpage, and blocking is carried out. The technical problem that the real-time detection requirement cannot be met in practical application is solved. The technical effect of reducing the time cost of the webpage matching process is achieved.

Description

Malicious webpage identification and detection method based on static domain, computer and storage medium

Technical Field

The present application relates to a detection method, and in particular, to a malicious web page identification detection method based on a static domain, a computer, and a storage medium, and belongs to the technical field of web page identification detection.

Background

Phishing attacks are cyber criminal behaviors that steal user private data through social engineering or technical means, in recent years, many lawbreakers engage in illegal activities by building malicious websites, and various means (such as URL confusion and the like) are utilized to increase the concealment of webpages, so that the traditional defense detection technology fails.

The webpage fingerprint is a byte sequence generated by hash operation calculation according to a key value pair of a response message header and a series of special elements (labels, attributes and the like) extracted from a webpage document. The web page identification is to identify the web page which is the most matched with the target web page from the web page candidate library.

And a webpage duplicate removal algorithm based on the webpage fingerprints. When the algorithm is executed, firstly, preprocessing and denoising are carried out on the detected webpage, pure text information of the webpage is reserved, normalization processing is carried out on the pure text information, keywords and position vectors thereof are extracted to form a webpage fingerprint, and then the similarity of a fingerprint database and the detected webpage is compared to judge whether the webpage is a repeated webpage.

A machine learning based detection method. Recognizing phishing websites as the problem of text classification or clustering, detecting webpages by using URL (uniform resource locator) formed words, DNS (domain name system) and Whois information, and judging the properties of the webpages by applying a machine learning method.

The fingerprint adopted in the webpage fingerprint-based deduplication algorithm is composed of a characteristic keyword extracted from a webpage and a position vector of the characteristic keyword, the characteristic word is extracted from pure text information in the webpage, and if the text size is too large, the fingerprint possibly occupies too much space resources in the storage process; only the plain text information displayed on the webpage is considered to be too one-sided, and the webpage fingerprint extraction technology provided by the algorithm is only suitable for webpage identification containing a large amount of text contents in the webpage and has no universality.

In the detection method based on machine learning, a large amount of resources are consumed for feature extraction and model training required by the machine learning method, and the method cannot meet the requirement of real-time detection in practical application.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.

In view of this, in order to solve the technical problem in the prior art that the need for real-time detection in practical applications cannot be met, the present invention provides a malicious web page identification and detection method based on a static domain, a computer, and a storage medium.

The first scheme is as follows: the invention provides a malicious webpage identification and detection method based on a static domain, which comprises the following steps:

monitoring webpage flow in real time, and extracting a URL (uniform resource locator) address of an HTTP head;

step two, matching the URL address in the step one with a URL address stored in a blacklist library; if the matching is successful, blocking the flow, and if the matching is failed, executing a third step;

step three, analyzing the webpage flow failed in matching;

step four, crawling JS and CSS files in the analyzed webpage flow;

step five, extracting the webpage fingerprint of the target webpage;

step six, identifying webpage flow; if the identification is successful, executing a step seven, and if the identification is failed, executing a step one;

step seven, comparing the URL addresses of the two webpages; if the URL addresses are the same, the webpage in the flow is a normal webpage, and a matching log is stored; if the URL addresses are different, the webpage in the flow is indicated to be a malicious webpage, and blocking is carried out.

Preferably, the specific method for analyzing the webpage traffic with failed matching in the third step includes the following steps:

step three, extracting a webpage source code from the response message;

step two, sequentially reading input source code character strings, calling a recursive algorithm to analyze the character strings, and enabling an initial father node to be empty;

step three, when a starting label or a text is analyzed, setting the label or the text node as a father node, and adding the node into a child list of the father node;

analyzing and extracting the attribute and the value of the label node and the text information of the text node;

and step three, returning when the tag is analyzed to be finished until all the input character strings are analyzed, and obtaining a complete DOM structure.

Specifically, the complete DOM structure provides data input for extracting the web fingerprint for step five.

Preferably, the specific method for crawling JS and CSS files in the parsed web page traffic in step four includes the following steps:

step four, initiating a Request to the target site through the HTTP library, namely sending a Request;

step two, receiving response content returned by the server;

and step four, analyzing the content, storing the analyzed JS and CSS resources as webpage characteristics.

Preferably, the specific method for extracting the webpage fingerprint of the target webpage in the step five is that the method comprises the steps of extracting a response message header, an HTML DOM tree head sub-tree and an HTML DOM tree body sub-tree;

the response message header extraction method comprises the following steps: classifying the response headers, and performing hash operation on fields directly related to the webpage source codes to extract fingerprint segments;

the method for extracting the head subtree of the HTML DOM tree comprises the following steps: regarding partial attributes of element nodes in the head subtree as a series of key value pairs, and performing fingerprint extraction similar to a response header;

the HTML DOM tree body subtree extraction method comprises the following steps: and performing hierarchical fingerprint extraction on each layer of nodes of M layers of front nodes of the body part, wherein each element node extracts one byte of fingerprint data, the hierarchical fingerprints are spliced into fingerprints of the body tree according to the extraction sequence, and the multi-branch tree structure is converted into a linear structure.

Preferably, the specific method for identifying the webpage flow in the sixth step includes the following steps:

step six, inquiring target webpage feature words in a feature library, taking the feature library as a webpage candidate set, screening out webpages in the webpage candidate set, wherein the number of the target webpage feature words is less than a predefined threshold value, and updating the webpage candidate set;

sixthly, obtaining a candidate webpage list P by the updated webpage candidate set_i＝(p₁，p₂，......，p_n) If the candidate web page V is_pIf the characteristic words are contained in the target webpage characteristic words, the characteristic words of the candidate webpage Vp are added into the candidate webpage V_p＝(w₁，w₂，......，w_n) Wherein the characteristic word W_iSorting according to the extraction sequence in the target webpage feature words to form a candidate webpage feature vector set, W_iThe calculation formula of (a) is as follows:

tf-idf_i,j＝tf_i,j×idf_i

wherein tf represents the word frequency of the characteristic words, idf represents the frequency of the reverse files, n_k,jRepresents the frequency of occurrence of the k-th word in web page J, | D | represents the total number of web pages in the candidate set, | { J: t |, and_i∈d_jdenotes the inclusion of the word t_iThe number of web pages;

step six and three, comparing the candidate web pages V_pAnd target webpage V_tFiltering the web pages with the similarity measurement lower than a set threshold value to obtain a final candidate web page set;

step six, matching the final candidate webpage set with the webpage fingerprints of the target webpage in the step five, wherein the specific method is that the similarity between the candidate webpage and the target webpage is calculated by applying an LCS algorithm in response to the linear sequence of the head fingerprints and the fingerprints of the HTML head part;

sixthly, the fingerprints of the HTML body part are formed by splicing layered fingerprints, the layered fingerprints are linear sequences, the similarity of the fingerprints of each layer is calculated by applying an LCS algorithm, and the average value is calculated to be the fingerprint matching similarity of the body part;

and step six, finally weighting and calculating the fingerprint similarity of the three parts to obtain the final fingerprint similarity, judging that the fingerprint matching is successful if the similarity is larger than a set threshold value, otherwise, judging that the fingerprint matching is failed, and simultaneously feeding back the result.

Preferably, in the third step, a webpage fingerprint structure is defined when the matching of the webpage flow fails.

Preferably, the fingerprint structure of the webpage in the third step is a flag bit with 4 bits, a fingerprint length with 12 bits and fingerprint data with 0-4096 bytes.

Scheme II: a computer comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the static domain-based malicious web page identification and detection method when executing the computer program.

The third scheme is as follows: a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the static domain-based malicious web page identification detection method according to one aspect.

The invention has the following beneficial effects: the invention provides a malicious webpage identification and detection method based on a static domain, which utilizes static resources and webpage structure information directly related to a webpage in a webpage transmission process, and applies fingerprint extraction and webpage identification to realize the detection of phishing websites. The fingerprint extraction method provided by the invention emphasizes on considering static resources on the webpage, fully considers the characteristics of a webpage DOM tree and a response message, has moderate length of the extracted webpage fingerprint which is more representative, and simultaneously maintains a group of characteristic vectors for each webpage for screening a webpage library to be matched; and a similarity matching algorithm based on fingerprints is adopted for fingerprint identification, so that the time consumption of a matching link is short, and the requirement of real-time detection is met. And denoising the deep nodes of the webpage DOM tree to improve the identification accuracy. The invention also extracts a group of feature vectors for each webpage in the webpage analysis process to be used for filtering the original candidate webpage library, thereby reducing the time cost of the webpage matching process. The technical problem that the real-time detection requirement in practical application cannot be met in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a diagram illustrating a structure of a fingerprint of a web page according to the present invention;

FIG. 3 is a flow chart illustrating the process of identifying web page traffic according to the present invention.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Embodiment 1, this embodiment is described with reference to fig. 1 to 3, and the present invention provides a malicious web page identification and detection method based on a static domain, including the following steps:

step three, analyzing the webpage flow failed in matching; the analysis method can improve the efficiency in the webpage analysis process, can perform a series of response processing and avoidance aiming at grammar and format errors existing on the webpage, can limit the program operation time by setting the maximum analysis depth, and can perform drying removal on deep secondary nodes. The method specifically comprises the following steps:

step three, extracting a webpage source code from the response message;

Specifically, the process of setting the maximum resolution depth to limit the program running time and performing drying on the deep secondary nodes is to limit the webpage fingerprint structure to be a flag bit with 4 bits, a fingerprint length with 12 bits and fingerprint data with 0-4096 bytes.

Step four, crawling JS and CSS files in the analyzed webpage flow; the method comprises the following steps:

step two, receiving response content returned by the server;

Step five, extracting the webpage fingerprint of the target webpage; the fingerprint extraction only focuses on the static domain resources of the webpage, and in order to enable the webpage fingerprint to be more representative, the source of the webpage fingerprint is divided into three parts, namely a response message header, an HTML DOM tree head subtree and an HTML DOM tree body subtree. The specific fingerprint extraction process is as follows:

specifically, before identifying the web page traffic, all target web page feature words are generated into a feature library.

Specifically, identifying the web page flow specifically comprises a filtering stage and a matching stage; the filtering stage comprises one-round filtering and two-round filtering;

the first filtering is based on the quantity and the sequence of the target webpage feature words; the two-round filtering is to filter the webpage sets with similarity lower than a certain threshold value based on cosine similarity measurement.

The filtering process includes firstly, screening a webpage candidate feature library according to the sequence and frequency of feature words extracted from a target webpage, and filtering out a webpage set with too few feature words or too large difference in the sequence of the feature words; and establishing a characteristic vector for each screened webpage, performing two-round filtering on the current candidate webpage set by comparing cosine similarity between vectors, and filtering out the webpage set with the similarity lower than a certain threshold value.

In the matching process, the size of the candidate webpage set after the filtering stage is almost single digit, at the moment, a matching algorithm is applied, the fingerprint data of the candidate webpage set are matched with the fingerprint data extracted from the target webpage one by one, the judgment hit is performed aiming at the matching degree higher than a certain threshold value, and otherwise, the identification feedback is performed to an administrator.

The identification of the webpage traffic specifically comprises the following steps:

first define | V_t＝(w₁，w₂，......，w_n)，w_iIs (content, id, index, f), where V_tThe feature vector of the target webpage, each dimension represents a webpage feature word w_iEach feature word structure stores the content of the feature word, the webpage ID, the dimension number of the feature word in the webpage vector and the frequency of the feature word.

sixthly, obtaining a candidate webpage list P by the updated webpage candidate set_i＝(p₁，p₂，......，p_n) If the candidate web page V is_pIf the characteristic words are contained in the target webpage characteristic words, the candidate webpage V is selected_pAdding the feature words into the candidate web pages V_p＝(w₁，w₂，......，w_nIn which the characteristic word W_iSorting according to the extraction sequence in the target webpage feature words to form a candidate webpage feature vector set, W_iThe calculation formula of (a) is as follows:

tf-idf_i,j＝tf_i,j×idf_i

in particular, because the candidate web page V_pThe feature words in (1) are contained in the target web page feature words, and the indexes are sorted in ascending order, so that the similarity measure compares the maximum ascending subsequence of the feature words.

Specifically, based on a Vector Space Model (VSM), each web page feature word in the VSM corresponds to one dimension in space, and each dimension is orthogonal to other dimensions. Refer to the TF-IDF algorithm.

Specifically, each dimension of the feature vector is a weighted value, and the feature vector corresponds to each web page. And then performing two rounds of screening according to the feature vectors.

Step six and three, comparing the candidate web pages V_pAnd target webpage V_tFiltering the web pages with the similarity measurement lower than a set threshold value to obtain a final candidate web page set; the formula is as follows:

specifically, the size of the candidate web page set may not be 1.

In the matching stage, the fingerprint matching realized by the invention is based on the longest public subsequence algorithm, and specifically comprises the following steps:

Sixthly, responding to the linear sequence of the head fingerprint and the fingerprint of the HTML head part, and calculating the similarity between the candidate webpage and the target webpage by applying an LCS algorithm;

The noun of the invention explains:

static domain definition: the static domain of the webpage refers to elements which are basically fixed and unchangeable in resources related to the webpage, and comprises a response message header, a DOM structure, a CSS structure, a part of JS files and the like.

Static domain web page fingerprint definition: static domain web page fingerprint refers to the result obtained by hash calculation according to partial resources (DOM structure and response header) of the static domain of the web page.

Embodiment 2 discloses a computer, and the computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit, and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.

Embodiment 3 computer-readable storage Medium

The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.

The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims

1. A malicious webpage identification and detection method based on a static domain is characterized by comprising the following steps:

step three, analyzing the webpage flow failed in matching;

step four, crawling JS and CSS files in the analyzed webpage flow;

step five, extracting the webpage fingerprint of the target webpage;

2. The method according to claim 1, wherein the specific method for analyzing the web page traffic with failed matching in step three is that the method comprises the following steps:

step three, extracting a webpage source code from the response message;

3. The method according to claim 1, wherein the specific method for crawling JS and CSS files in the parsed webpage traffic in step four is that the method comprises the following steps:

step two, receiving response content returned by the server;

4. The method of claim 3, wherein the step five of extracting the web fingerprint of the target web page comprises extracting a response message header, a head sub-tree of an HTML DOM tree, a body sub-tree of the HTML DOM tree;

5. The method according to claim 4, wherein the specific method for identifying the web page traffic in the sixth step is that the method comprises the following steps:

sixthly, obtaining a candidate webpage list P by the updated webpage candidate set_i＝(p₁，p₂，......，p_n) If the candidate web page V is_pIf the characteristic words are contained in the target webpage characteristic words, the candidate webpage V is selected_pAdding the feature words into the candidate web pages V_p＝(w₁，w₂，......，w_n) Wherein the characteristic word W_iAccording to the target web page characteristicsExtracting sequence ordering from the token words to form a candidate web page feature vector set, W_iThe calculation formula of (a) is as follows:

tf-idf_i，j＝tf_i，j×idf_i

wherein tf represents the word frequency of the characteristic words, idf represents the frequency of the reverse files, n_k，jRepresents the frequency of occurrence of the k-th word in web page J, | D | represents the total number of web pages in the candidate set, | { J: t is t_i∈d_jDenotes the inclusion of the word t_iThe number of web pages;

6. The method according to claim 2, wherein the step three is to define the web page fingerprint structure when analyzing the web page traffic with failed matching.

7. The method according to claim 6, wherein the fingerprint structure of the web page in step three is a flag bit of 4 bits, a fingerprint length of 12 bits and fingerprint data of 0-4096 bytes.

8. A computer comprising a memory storing a computer program and a processor implementing the steps of the method of any one of claims 1 to 7 when the computer program is executed by the processor.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.