CN114372267A - Malicious webpage identification and detection method based on static domain, computer and storage medium - Google Patents

Malicious webpage identification and detection method based on static domain, computer and storage medium Download PDF

Info

Publication number
CN114372267A
CN114372267A CN202111340418.3A CN202111340418A CN114372267A CN 114372267 A CN114372267 A CN 114372267A CN 202111340418 A CN202111340418 A CN 202111340418A CN 114372267 A CN114372267 A CN 114372267A
Authority
CN
China
Prior art keywords
webpage
fingerprint
matching
candidate
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111340418.3A
Other languages
Chinese (zh)
Other versions
CN114372267B (en
Inventor
余翔湛
刘立坤
陈巍
史建焘
葛蒙蒙
叶麟
于喜东
王永强
冯帅
赵跃
王久金
宋赟祖
郭明昊
胡智超
苗钧重
刘凡
李精卫
石开宇
韦贤葵
孔德文
羿天阳
刘奉哲
李竑杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Shanghai Pudong Development Bank Co Ltd
Original Assignee
Harbin Institute of Technology
Shanghai Pudong Development Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology, Shanghai Pudong Development Bank Co Ltd filed Critical Harbin Institute of Technology
Priority to CN202111340418.3A priority Critical patent/CN114372267B/en
Publication of CN114372267A publication Critical patent/CN114372267A/en
Application granted granted Critical
Publication of CN114372267B publication Critical patent/CN114372267B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • G06F21/563Static detection by source code analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Virology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a malicious webpage identification and detection method based on a static domain, a computer and a storage medium, and belongs to the technical field of webpage identification and detection. Monitoring webpage flow in real time, and extracting a URL (uniform resource locator) address of an HTTP head; step two, matching the URL address with the URL address stored in the blacklist library; step three, analyzing the webpage flow failed in matching; step four, crawling JS and CSS files in the analyzed webpage flow; step five, extracting the webpage fingerprint of the target webpage; step six, identifying webpage flow; step seven, comparing the URL addresses of the two webpages; if the URL addresses are the same, the webpage in the flow is a normal webpage, and a matching log is stored; if the URL addresses are different, the webpage in the flow is indicated to be a malicious webpage, and blocking is carried out. The technical problem that the real-time detection requirement cannot be met in practical application is solved. The technical effect of reducing the time cost of the webpage matching process is achieved.

Description

Malicious webpage identification and detection method based on static domain, computer and storage medium
Technical Field
The present application relates to a detection method, and in particular, to a malicious web page identification detection method based on a static domain, a computer, and a storage medium, and belongs to the technical field of web page identification detection.
Background
Phishing attacks are cyber criminal behaviors that steal user private data through social engineering or technical means, in recent years, many lawbreakers engage in illegal activities by building malicious websites, and various means (such as URL confusion and the like) are utilized to increase the concealment of webpages, so that the traditional defense detection technology fails.
The webpage fingerprint is a byte sequence generated by hash operation calculation according to a key value pair of a response message header and a series of special elements (labels, attributes and the like) extracted from a webpage document. The web page identification is to identify the web page which is the most matched with the target web page from the web page candidate library.
And a webpage duplicate removal algorithm based on the webpage fingerprints. When the algorithm is executed, firstly, preprocessing and denoising are carried out on the detected webpage, pure text information of the webpage is reserved, normalization processing is carried out on the pure text information, keywords and position vectors thereof are extracted to form a webpage fingerprint, and then the similarity of a fingerprint database and the detected webpage is compared to judge whether the webpage is a repeated webpage.
A machine learning based detection method. Recognizing phishing websites as the problem of text classification or clustering, detecting webpages by using URL (uniform resource locator) formed words, DNS (domain name system) and Whois information, and judging the properties of the webpages by applying a machine learning method.
The fingerprint adopted in the webpage fingerprint-based deduplication algorithm is composed of a characteristic keyword extracted from a webpage and a position vector of the characteristic keyword, the characteristic word is extracted from pure text information in the webpage, and if the text size is too large, the fingerprint possibly occupies too much space resources in the storage process; only the plain text information displayed on the webpage is considered to be too one-sided, and the webpage fingerprint extraction technology provided by the algorithm is only suitable for webpage identification containing a large amount of text contents in the webpage and has no universality.
In the detection method based on machine learning, a large amount of resources are consumed for feature extraction and model training required by the machine learning method, and the method cannot meet the requirement of real-time detection in practical application.
Disclosure of Invention
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to determine the key or critical elements of the present invention, nor is it intended to limit the scope of the present invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later.
In view of this, in order to solve the technical problem in the prior art that the need for real-time detection in practical applications cannot be met, the present invention provides a malicious web page identification and detection method based on a static domain, a computer, and a storage medium.
The first scheme is as follows: the invention provides a malicious webpage identification and detection method based on a static domain, which comprises the following steps:
monitoring webpage flow in real time, and extracting a URL (uniform resource locator) address of an HTTP head;
step two, matching the URL address in the step one with a URL address stored in a blacklist library; if the matching is successful, blocking the flow, and if the matching is failed, executing a third step;
step three, analyzing the webpage flow failed in matching;
step four, crawling JS and CSS files in the analyzed webpage flow;
step five, extracting the webpage fingerprint of the target webpage;
step six, identifying webpage flow; if the identification is successful, executing a step seven, and if the identification is failed, executing a step one;
step seven, comparing the URL addresses of the two webpages; if the URL addresses are the same, the webpage in the flow is a normal webpage, and a matching log is stored; if the URL addresses are different, the webpage in the flow is indicated to be a malicious webpage, and blocking is carried out.
Preferably, the specific method for analyzing the webpage traffic with failed matching in the third step includes the following steps:
step three, extracting a webpage source code from the response message;
step two, sequentially reading input source code character strings, calling a recursive algorithm to analyze the character strings, and enabling an initial father node to be empty;
step three, when a starting label or a text is analyzed, setting the label or the text node as a father node, and adding the node into a child list of the father node;
analyzing and extracting the attribute and the value of the label node and the text information of the text node;
and step three, returning when the tag is analyzed to be finished until all the input character strings are analyzed, and obtaining a complete DOM structure.
Specifically, the complete DOM structure provides data input for extracting the web fingerprint for step five.
Preferably, the specific method for crawling JS and CSS files in the parsed web page traffic in step four includes the following steps:
step four, initiating a Request to the target site through the HTTP library, namely sending a Request;
step two, receiving response content returned by the server;
and step four, analyzing the content, storing the analyzed JS and CSS resources as webpage characteristics.
Preferably, the specific method for extracting the webpage fingerprint of the target webpage in the step five is that the method comprises the steps of extracting a response message header, an HTML DOM tree head sub-tree and an HTML DOM tree body sub-tree;
the response message header extraction method comprises the following steps: classifying the response headers, and performing hash operation on fields directly related to the webpage source codes to extract fingerprint segments;
the method for extracting the head subtree of the HTML DOM tree comprises the following steps: regarding partial attributes of element nodes in the head subtree as a series of key value pairs, and performing fingerprint extraction similar to a response header;
the HTML DOM tree body subtree extraction method comprises the following steps: and performing hierarchical fingerprint extraction on each layer of nodes of M layers of front nodes of the body part, wherein each element node extracts one byte of fingerprint data, the hierarchical fingerprints are spliced into fingerprints of the body tree according to the extraction sequence, and the multi-branch tree structure is converted into a linear structure.
Preferably, the specific method for identifying the webpage flow in the sixth step includes the following steps:
step six, inquiring target webpage feature words in a feature library, taking the feature library as a webpage candidate set, screening out webpages in the webpage candidate set, wherein the number of the target webpage feature words is less than a predefined threshold value, and updating the webpage candidate set;
sixthly, obtaining a candidate webpage list P by the updated webpage candidate seti=(p1,p2,......,pn) If the candidate web page V ispIf the characteristic words are contained in the target webpage characteristic words, the characteristic words of the candidate webpage Vp are added into the candidate webpage Vp=(w1,w2,......,wn) Wherein the characteristic word WiSorting according to the extraction sequence in the target webpage feature words to form a candidate webpage feature vector set, WiThe calculation formula of (a) is as follows:
tf-idfi,j=tfi,j×idfi
Figure BDA0003351683670000031
wherein tf represents the word frequency of the characteristic words, idf represents the frequency of the reverse files, nk,jRepresents the frequency of occurrence of the k-th word in web page J, | D | represents the total number of web pages in the candidate set, | { J: t |, andi∈djdenotes the inclusion of the word tiThe number of web pages;
step six and three, comparing the candidate web pages VpAnd target webpage VtFiltering the web pages with the similarity measurement lower than a set threshold value to obtain a final candidate web page set;
step six, matching the final candidate webpage set with the webpage fingerprints of the target webpage in the step five, wherein the specific method is that the similarity between the candidate webpage and the target webpage is calculated by applying an LCS algorithm in response to the linear sequence of the head fingerprints and the fingerprints of the HTML head part;
sixthly, the fingerprints of the HTML body part are formed by splicing layered fingerprints, the layered fingerprints are linear sequences, the similarity of the fingerprints of each layer is calculated by applying an LCS algorithm, and the average value is calculated to be the fingerprint matching similarity of the body part;
and step six, finally weighting and calculating the fingerprint similarity of the three parts to obtain the final fingerprint similarity, judging that the fingerprint matching is successful if the similarity is larger than a set threshold value, otherwise, judging that the fingerprint matching is failed, and simultaneously feeding back the result.
Preferably, in the third step, a webpage fingerprint structure is defined when the matching of the webpage flow fails.
Preferably, the fingerprint structure of the webpage in the third step is a flag bit with 4 bits, a fingerprint length with 12 bits and fingerprint data with 0-4096 bytes.
Scheme II: a computer comprising a memory and a processor, the memory storing a computer program, the processor implementing the steps of the static domain-based malicious web page identification and detection method when executing the computer program.
The third scheme is as follows: a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the static domain-based malicious web page identification detection method according to one aspect.
The invention has the following beneficial effects: the invention provides a malicious webpage identification and detection method based on a static domain, which utilizes static resources and webpage structure information directly related to a webpage in a webpage transmission process, and applies fingerprint extraction and webpage identification to realize the detection of phishing websites. The fingerprint extraction method provided by the invention emphasizes on considering static resources on the webpage, fully considers the characteristics of a webpage DOM tree and a response message, has moderate length of the extracted webpage fingerprint which is more representative, and simultaneously maintains a group of characteristic vectors for each webpage for screening a webpage library to be matched; and a similarity matching algorithm based on fingerprints is adopted for fingerprint identification, so that the time consumption of a matching link is short, and the requirement of real-time detection is met. And denoising the deep nodes of the webpage DOM tree to improve the identification accuracy. The invention also extracts a group of feature vectors for each webpage in the webpage analysis process to be used for filtering the original candidate webpage library, thereby reducing the time cost of the webpage matching process. The technical problem that the real-time detection requirement in practical application cannot be met in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a diagram illustrating a structure of a fingerprint of a web page according to the present invention;
FIG. 3 is a flow chart illustrating the process of identifying web page traffic according to the present invention.
Detailed Description
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Embodiment 1, this embodiment is described with reference to fig. 1 to 3, and the present invention provides a malicious web page identification and detection method based on a static domain, including the following steps:
monitoring webpage flow in real time, and extracting a URL (uniform resource locator) address of an HTTP head;
step two, matching the URL address in the step one with a URL address stored in a blacklist library; if the matching is successful, blocking the flow, and if the matching is failed, executing a third step;
step three, analyzing the webpage flow failed in matching; the analysis method can improve the efficiency in the webpage analysis process, can perform a series of response processing and avoidance aiming at grammar and format errors existing on the webpage, can limit the program operation time by setting the maximum analysis depth, and can perform drying removal on deep secondary nodes. The method specifically comprises the following steps:
step three, extracting a webpage source code from the response message;
step two, sequentially reading input source code character strings, calling a recursive algorithm to analyze the character strings, and enabling an initial father node to be empty;
step three, when a starting label or a text is analyzed, setting the label or the text node as a father node, and adding the node into a child list of the father node;
analyzing and extracting the attribute and the value of the label node and the text information of the text node;
and step three, returning when the tag is analyzed to be finished until all the input character strings are analyzed, and obtaining a complete DOM structure.
Specifically, the process of setting the maximum resolution depth to limit the program running time and performing drying on the deep secondary nodes is to limit the webpage fingerprint structure to be a flag bit with 4 bits, a fingerprint length with 12 bits and fingerprint data with 0-4096 bytes.
Step four, crawling JS and CSS files in the analyzed webpage flow; the method comprises the following steps:
step four, initiating a Request to the target site through the HTTP library, namely sending a Request;
step two, receiving response content returned by the server;
and step four, analyzing the content, storing the analyzed JS and CSS resources as webpage characteristics.
Step five, extracting the webpage fingerprint of the target webpage; the fingerprint extraction only focuses on the static domain resources of the webpage, and in order to enable the webpage fingerprint to be more representative, the source of the webpage fingerprint is divided into three parts, namely a response message header, an HTML DOM tree head subtree and an HTML DOM tree body subtree. The specific fingerprint extraction process is as follows:
the response message header extraction method comprises the following steps: classifying the response headers, and performing hash operation on fields directly related to the webpage source codes to extract fingerprint segments;
the method for extracting the head subtree of the HTML DOM tree comprises the following steps: regarding partial attributes of element nodes in the head subtree as a series of key value pairs, and performing fingerprint extraction similar to a response header;
the HTML DOM tree body subtree extraction method comprises the following steps: and performing hierarchical fingerprint extraction on each layer of nodes of M layers of front nodes of the body part, wherein each element node extracts one byte of fingerprint data, the hierarchical fingerprints are spliced into fingerprints of the body tree according to the extraction sequence, and the multi-branch tree structure is converted into a linear structure.
Step six, identifying webpage flow; if the identification is successful, executing a step seven, and if the identification is failed, executing a step one;
specifically, before identifying the web page traffic, all target web page feature words are generated into a feature library.
Specifically, identifying the web page flow specifically comprises a filtering stage and a matching stage; the filtering stage comprises one-round filtering and two-round filtering;
the first filtering is based on the quantity and the sequence of the target webpage feature words; the two-round filtering is to filter the webpage sets with similarity lower than a certain threshold value based on cosine similarity measurement.
The filtering process includes firstly, screening a webpage candidate feature library according to the sequence and frequency of feature words extracted from a target webpage, and filtering out a webpage set with too few feature words or too large difference in the sequence of the feature words; and establishing a characteristic vector for each screened webpage, performing two-round filtering on the current candidate webpage set by comparing cosine similarity between vectors, and filtering out the webpage set with the similarity lower than a certain threshold value.
In the matching process, the size of the candidate webpage set after the filtering stage is almost single digit, at the moment, a matching algorithm is applied, the fingerprint data of the candidate webpage set are matched with the fingerprint data extracted from the target webpage one by one, the judgment hit is performed aiming at the matching degree higher than a certain threshold value, and otherwise, the identification feedback is performed to an administrator.
The identification of the webpage traffic specifically comprises the following steps:
first define | Vt=(w1,w2,......,wn),wiIs (content, id, index, f), where VtThe feature vector of the target webpage, each dimension represents a webpage feature word wiEach feature word structure stores the content of the feature word, the webpage ID, the dimension number of the feature word in the webpage vector and the frequency of the feature word.
Step six, inquiring target webpage feature words in a feature library, taking the feature library as a webpage candidate set, screening out webpages in the webpage candidate set, wherein the number of the target webpage feature words is less than a predefined threshold value, and updating the webpage candidate set;
sixthly, obtaining a candidate webpage list P by the updated webpage candidate seti=(p1,p2,......,pn) If the candidate web page V ispIf the characteristic words are contained in the target webpage characteristic words, the candidate webpage V is selectedpAdding the feature words into the candidate web pages Vp=(w1,w2,......,wnIn which the characteristic word WiSorting according to the extraction sequence in the target webpage feature words to form a candidate webpage feature vector set, WiThe calculation formula of (a) is as follows:
tf-idfi,j=tfi,j×idfi
Figure BDA0003351683670000061
wherein tf represents the word frequency of the characteristic words, idf represents the frequency of the reverse files, nk,jRepresents the frequency of occurrence of the k-th word in web page J, | D | represents the total number of web pages in the candidate set, | { J: t |, andi∈djdenotes the inclusion of the word tiThe number of web pages;
step six and three, comparing the candidate web pages VpAnd target webpage VtFiltering the web pages with the similarity measurement lower than a set threshold value to obtain a final candidate web page set;
in particular, because the candidate web page VpThe feature words in (1) are contained in the target web page feature words, and the indexes are sorted in ascending order, so that the similarity measure compares the maximum ascending subsequence of the feature words.
Specifically, based on a Vector Space Model (VSM), each web page feature word in the VSM corresponds to one dimension in space, and each dimension is orthogonal to other dimensions. Refer to the TF-IDF algorithm.
Specifically, each dimension of the feature vector is a weighted value, and the feature vector corresponds to each web page. And then performing two rounds of screening according to the feature vectors.
Step six and three, comparing the candidate web pages VpAnd target webpage VtFiltering the web pages with the similarity measurement lower than a set threshold value to obtain a final candidate web page set; the formula is as follows:
Figure BDA0003351683670000071
specifically, the size of the candidate web page set may not be 1.
In the matching stage, the fingerprint matching realized by the invention is based on the longest public subsequence algorithm, and specifically comprises the following steps:
step six, matching the final candidate webpage set with the webpage fingerprints of the target webpage in the step five, wherein the specific method is that the similarity between the candidate webpage and the target webpage is calculated by applying an LCS algorithm in response to the linear sequence of the head fingerprints and the fingerprints of the HTML head part;
sixthly, the fingerprints of the HTML body part are formed by splicing layered fingerprints, the layered fingerprints are linear sequences, the similarity of the fingerprints of each layer is calculated by applying an LCS algorithm, and the average value is calculated to be the fingerprint matching similarity of the body part;
and step six, finally weighting and calculating the fingerprint similarity of the three parts to obtain the final fingerprint similarity, judging that the fingerprint matching is successful if the similarity is larger than a set threshold value, otherwise, judging that the fingerprint matching is failed, and simultaneously feeding back the result.
Sixthly, responding to the linear sequence of the head fingerprint and the fingerprint of the HTML head part, and calculating the similarity between the candidate webpage and the target webpage by applying an LCS algorithm;
step seven, comparing the URL addresses of the two webpages; if the URL addresses are the same, the webpage in the flow is a normal webpage, and a matching log is stored; if the URL addresses are different, the webpage in the flow is indicated to be a malicious webpage, and blocking is carried out.
The noun of the invention explains:
static domain definition: the static domain of the webpage refers to elements which are basically fixed and unchangeable in resources related to the webpage, and comprises a response message header, a DOM structure, a CSS structure, a part of JS files and the like.
Static domain web page fingerprint definition: static domain web page fingerprint refers to the result obtained by hash calculation according to partial resources (DOM structure and response header) of the static domain of the web page.
Embodiment 2 discloses a computer, and the computer device of the present invention may be a device including a processor, a memory, and the like, for example, a single chip microcomputer including a central processing unit, and the like. And the processor is used for implementing the steps of the recommendation method capable of modifying the relationship-driven recommendation data based on the CREO software when executing the computer program stored in the memory.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Embodiment 3 computer-readable storage Medium
The computer readable storage medium of the present invention may be any form of storage medium that can be read by a processor of a computer device, including but not limited to non-volatile memory, ferroelectric memory, etc., and the computer readable storage medium has stored thereon a computer program that, when the computer program stored in the memory is read and executed by the processor of the computer device, can implement the above-mentioned steps of the CREO-based software that can modify the modeling method of the relationship-driven modeling data.
The computer program comprises computer program code which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed in an illustrative rather than a restrictive sense, and the scope of the present invention is defined by the appended claims.

Claims (9)

1. A malicious webpage identification and detection method based on a static domain is characterized by comprising the following steps:
monitoring webpage flow in real time, and extracting a URL (uniform resource locator) address of an HTTP head;
step two, matching the URL address in the step one with a URL address stored in a blacklist library; if the matching is successful, blocking the flow, and if the matching is failed, executing a third step;
step three, analyzing the webpage flow failed in matching;
step four, crawling JS and CSS files in the analyzed webpage flow;
step five, extracting the webpage fingerprint of the target webpage;
step six, identifying webpage flow; if the identification is successful, executing a step seven, and if the identification is failed, executing a step one;
step seven, comparing the URL addresses of the two webpages; if the URL addresses are the same, the webpage in the flow is a normal webpage, and a matching log is stored; if the URL addresses are different, the webpage in the flow is indicated to be a malicious webpage, and blocking is carried out.
2. The method according to claim 1, wherein the specific method for analyzing the web page traffic with failed matching in step three is that the method comprises the following steps:
step three, extracting a webpage source code from the response message;
step two, sequentially reading input source code character strings, calling a recursive algorithm to analyze the character strings, and enabling an initial father node to be empty;
step three, when a starting label or a text is analyzed, setting the label or the text node as a father node, and adding the node into a child list of the father node;
analyzing and extracting the attribute and the value of the label node and the text information of the text node;
and step three, returning when the tag is analyzed to be finished until all the input character strings are analyzed, and obtaining a complete DOM structure.
3. The method according to claim 1, wherein the specific method for crawling JS and CSS files in the parsed webpage traffic in step four is that the method comprises the following steps:
step four, initiating a Request to the target site through the HTTP library, namely sending a Request;
step two, receiving response content returned by the server;
and step four, analyzing the content, storing the analyzed JS and CSS resources as webpage characteristics.
4. The method of claim 3, wherein the step five of extracting the web fingerprint of the target web page comprises extracting a response message header, a head sub-tree of an HTML DOM tree, a body sub-tree of the HTML DOM tree;
the response message header extraction method comprises the following steps: classifying the response headers, and performing hash operation on fields directly related to the webpage source codes to extract fingerprint segments;
the method for extracting the head subtree of the HTML DOM tree comprises the following steps: regarding partial attributes of element nodes in the head subtree as a series of key value pairs, and performing fingerprint extraction similar to a response header;
the HTML DOM tree body subtree extraction method comprises the following steps: and performing hierarchical fingerprint extraction on each layer of nodes of M layers of front nodes of the body part, wherein each element node extracts one byte of fingerprint data, the hierarchical fingerprints are spliced into fingerprints of the body tree according to the extraction sequence, and the multi-branch tree structure is converted into a linear structure.
5. The method according to claim 4, wherein the specific method for identifying the web page traffic in the sixth step is that the method comprises the following steps:
step six, inquiring target webpage feature words in a feature library, taking the feature library as a webpage candidate set, screening out webpages in the webpage candidate set, wherein the number of the target webpage feature words is less than a predefined threshold value, and updating the webpage candidate set;
sixthly, obtaining a candidate webpage list P by the updated webpage candidate seti=(p1,p2,......,pn) If the candidate web page V ispIf the characteristic words are contained in the target webpage characteristic words, the candidate webpage V is selectedpAdding the feature words into the candidate web pages Vp=(w1,w2,......,wn) Wherein the characteristic word WiAccording to the target web page characteristicsExtracting sequence ordering from the token words to form a candidate web page feature vector set, WiThe calculation formula of (a) is as follows:
tf-idfi,j=tfi,j×idfi
Figure FDA0003351683660000021
wherein tf represents the word frequency of the characteristic words, idf represents the frequency of the reverse files, nk,jRepresents the frequency of occurrence of the k-th word in web page J, | D | represents the total number of web pages in the candidate set, | { J: t is ti∈djDenotes the inclusion of the word tiThe number of web pages;
step six and three, comparing the candidate web pages VpAnd target webpage VtFiltering the web pages with the similarity measurement lower than a set threshold value to obtain a final candidate web page set;
step six, matching the final candidate webpage set with the webpage fingerprints of the target webpage in the step five, wherein the specific method is that the similarity between the candidate webpage and the target webpage is calculated by applying an LCS algorithm in response to the linear sequence of the head fingerprints and the fingerprints of the HTML head part;
sixthly, the fingerprints of the HTML body part are formed by splicing layered fingerprints, the layered fingerprints are linear sequences, the similarity of the fingerprints of each layer is calculated by applying an LCS algorithm, and the average value is calculated to be the fingerprint matching similarity of the body part;
and step six, finally weighting and calculating the fingerprint similarity of the three parts to obtain the final fingerprint similarity, judging that the fingerprint matching is successful if the similarity is larger than a set threshold value, otherwise, judging that the fingerprint matching is failed, and simultaneously feeding back the result.
6. The method according to claim 2, wherein the step three is to define the web page fingerprint structure when analyzing the web page traffic with failed matching.
7. The method according to claim 6, wherein the fingerprint structure of the web page in step three is a flag bit of 4 bits, a fingerprint length of 12 bits and fingerprint data of 0-4096 bytes.
8. A computer comprising a memory storing a computer program and a processor implementing the steps of the method of any one of claims 1 to 7 when the computer program is executed by the processor.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 7.
CN202111340418.3A 2021-11-12 2021-11-12 Malicious webpage identification detection method based on static domain, computer and storage medium Active CN114372267B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111340418.3A CN114372267B (en) 2021-11-12 2021-11-12 Malicious webpage identification detection method based on static domain, computer and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111340418.3A CN114372267B (en) 2021-11-12 2021-11-12 Malicious webpage identification detection method based on static domain, computer and storage medium

Publications (2)

Publication Number Publication Date
CN114372267A true CN114372267A (en) 2022-04-19
CN114372267B CN114372267B (en) 2024-05-28

Family

ID=81137816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111340418.3A Active CN114372267B (en) 2021-11-12 2021-11-12 Malicious webpage identification detection method based on static domain, computer and storage medium

Country Status (1)

Country Link
CN (1) CN114372267B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114900363A (en) * 2022-05-18 2022-08-12 杭州安恒信息技术股份有限公司 Malicious website identification method and device, electronic equipment and storage medium
CN116305296A (en) * 2023-05-19 2023-06-23 北京长亭科技有限公司 Web fingerprint identification method, system, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
WO2015172567A1 (en) * 2014-05-12 2015-11-19 中国科学院计算机网络信息中心 Internet information searching, aggregating and presentation method
CN108509794A (en) * 2018-03-09 2018-09-07 中山大学 A kind of malicious web pages defence detection method based on classification learning algorithm
CN111967063A (en) * 2020-09-02 2020-11-20 开普云信息科技股份有限公司 Data tampering monitoring and identifying method and device based on multi-dimensional analysis, electronic equipment and storage medium thereof
WO2020248379A1 (en) * 2019-06-11 2020-12-17 平安科技(深圳)有限公司 Method for searching for similar network pages, and apparatus
CN113609246A (en) * 2021-08-04 2021-11-05 上海犇众信息技术有限公司 Webpage similarity detection method and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103902889A (en) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 Malicious message cloud detection method and server
WO2015172567A1 (en) * 2014-05-12 2015-11-19 中国科学院计算机网络信息中心 Internet information searching, aggregating and presentation method
CN108509794A (en) * 2018-03-09 2018-09-07 中山大学 A kind of malicious web pages defence detection method based on classification learning algorithm
WO2020248379A1 (en) * 2019-06-11 2020-12-17 平安科技(深圳)有限公司 Method for searching for similar network pages, and apparatus
CN111967063A (en) * 2020-09-02 2020-11-20 开普云信息科技股份有限公司 Data tampering monitoring and identifying method and device based on multi-dimensional analysis, electronic equipment and storage medium thereof
CN113609246A (en) * 2021-08-04 2021-11-05 上海犇众信息技术有限公司 Webpage similarity detection method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
文凯: "恶意网页检测***设计及在云架构中的应用", 31 December 2013 (2013-12-31) *
王正琦;冯晓兵;张驰;: "基于两层分类器的恶意网页快速检测***研究", 网络与信息安全学报, no. 08, 15 August 2017 (2017-08-15) *
莫芊芊;张源;: "Cordova应用中跨域访问行为的识别与风险评估", 计算机应用与软件, no. 02, 12 February 2020 (2020-02-12) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114900363A (en) * 2022-05-18 2022-08-12 杭州安恒信息技术股份有限公司 Malicious website identification method and device, electronic equipment and storage medium
CN114900363B (en) * 2022-05-18 2024-05-14 杭州安恒信息技术股份有限公司 Malicious website identification method and device, electronic equipment and storage medium
CN116305296A (en) * 2023-05-19 2023-06-23 北京长亭科技有限公司 Web fingerprint identification method, system, equipment and storage medium
CN116305296B (en) * 2023-05-19 2023-07-21 北京长亭科技有限公司 Web fingerprint identification method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN114372267B (en) 2024-05-28

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
CN108965245B (en) Phishing website detection method and system based on self-adaptive heterogeneous multi-classification model
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
CN111897962B (en) Asset marking method and device for Internet of things
Unar et al. Detected text‐based image retrieval approach for textual images
CN105095435A (en) Similarity comparison method and device for high-dimensional image features
CN114372267B (en) Malicious webpage identification detection method based on static domain, computer and storage medium
CN108038173B (en) Webpage classification method and system and webpage classification equipment
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
US20150161278A1 (en) Method and apparatus for identifying webpage type
CN110572359A (en) Phishing webpage detection method based on machine learning
CN103324886B (en) A kind of extracting method of fingerprint database in network intrusion detection and system
CN106250402B (en) Website classification method and device
CN114915468B (en) Intelligent analysis and detection method for network crime based on knowledge graph
CN114650176A (en) Phishing website detection method and device, computer equipment and storage medium
CN110855635B (en) URL (Uniform resource locator) identification method and device and data processing equipment
CN115098440A (en) Electronic archive query method, device, storage medium and equipment
CN109284465B (en) URL-based web page classifier construction method and classification method thereof
CN112487422B (en) Malicious document detection method and device, electronic equipment and storage medium
CN112380537A (en) Method, device, storage medium and electronic equipment for detecting malicious software
CN111797904A (en) Method and device for detecting tampering of webpage features
Zhang et al. Effective and Fast Near Duplicate Detection via Signature‐Based Compression Metrics
CN111062199A (en) Bad information identification method and device
CN110851828A (en) Malicious URL monitoring method and device based on multi-dimensional features and electronic equipment
CN112163217B (en) Malware variant identification method, device, equipment and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant