CN111177619B

CN111177619B - Webpage identification method and device, storage medium and processor

Info

Publication number: CN111177619B
Application number: CN201911320564.2A
Authority: CN
Inventors: 蒋自立; 贺志强; 许勇
Original assignee: Hillstone Networks Co Ltd
Current assignee: Hillstone Networks Co Ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2022-09-09
Anticipated expiration: 2039-12-19
Also published as: CN111177619A

Abstract

The invention discloses a webpage identification method, a webpage identification device, a storage medium and a processor. Wherein, the method comprises the following steps: acquiring target information obtained by accessing a target network resource address; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information. The invention solves the technical problem of low accuracy of webpage identification.

Description

Webpage identification method and device, storage medium and processor

Technical Field

The invention relates to the field of internet, in particular to a webpage identification method, a webpage identification device, a storage medium and a processor.

Background

Currently, when a web page is identified, it is usually determined whether the web page is an abnormal page according to a Hyper Text Transfer Protocol (HTTP) status code, a status bar of an HTTP Response (HTTP Response) may be obtained, a status code is extracted from the status bar, and whether the web page to be processed is an abnormal page is directly determined based on the status code, for example, when the returned status code is an abnormal status code, it is directly determined that the web page is abnormal, and when the returned status code is not an abnormal status code, it is directly determined that the web page is normal.

However, the above method cannot effectively identify whether the web page is abnormal or not for the self-defined status code of the site, but not for the status code of the abnormal page; in addition, if the fault tolerance processing is performed on the site, the server may return to the current page or the default page under the condition that the access page is wrong, so that the current webpage is determined as the normal webpage by mistake, the webpage cannot be effectively identified as the abnormal webpage, and the technical problem of low accuracy in identifying the webpage exists.

In view of the above technical problem of low accuracy in identifying web pages, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a webpage identification method, a webpage identification device, a storage medium and a processor, and at least solves the technical problem of low accuracy of webpage identification.

According to one aspect of the embodiment of the invention, a webpage identification method is provided. The method comprises the following steps: the webpage identification method is characterized by comprising the following steps: acquiring target information obtained by accessing a target network resource address; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

Optionally, the format of the exception network resource address is the same as the format of the target network resource address.

Optionally, the method further comprises: under the condition that the file name exists in the target network resource address, modifying the file name, and determining the target network resource address after the file name is modified as an abnormal network resource address; and under the condition that the file name does not exist in the target network address, randomly generating a target file name, and determining the target network resource address comprising the target file name as an abnormal network resource address.

Optionally, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: and in the case that the target information does not comprise an abnormal state code corresponding to the abnormal network resource address, determining whether the target webpage is abnormal or not based on the abnormal information.

Optionally, in a case where the target information does not include the exception status code, determining whether the target web page is abnormal based on the exception information includes: and under the condition that the target information does not comprise the abnormal state code, if the abnormal information comprises the abnormal state code, determining that the target webpage is a non-target abnormal webpage.

Optionally, the method further comprises: if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information; and under the condition that the first response head value is different from the second response head value, determining that the target webpage is a non-target abnormal webpage.

Optionally, the method further comprises: under the condition that the first response head value and the second response head value are the same, acquiring a first similarity between the target information and the abnormal information; determining the target webpage as a target abnormal webpage under the condition that the first similarity is larger than a first threshold value; and under the condition that the first similarity is not larger than a first threshold value, determining that the target webpage is a non-target abnormal webpage.

Optionally, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: dividing a target abnormal webpage corresponding to the abnormal network resource address into a plurality of first sub-webpages; identifying an abnormal sub-web page comprising abnormal information from a plurality of first sub-web pages; dividing the target webpage into a plurality of second sub-webpages, and determining the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages; acquiring a second similarity between the abnormal sub-web page and the target sub-web page; determining the target webpage as a target abnormal webpage under the condition that the second similarity is larger than a second threshold value; and under the condition that the second similarity is not larger than a second threshold value, determining that the target webpage is a non-target abnormal webpage.

According to another aspect of the embodiment of the invention, another webpage identification method is also provided. The method can comprise the following steps: displaying target information obtained by accessing the target network resource address in the interactive interface; displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information.

According to another aspect of the embodiment of the invention, a webpage identification device is also provided. The device comprises: a first acquisition unit configured to acquire target information obtained by accessing a target network resource address; the second acquisition unit is used for acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and the third acquisition unit is used for determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

According to another aspect of the embodiment of the invention, another webpage identification device is also provided. The device includes: the first display unit is used for displaying target information obtained by accessing the target network resource address in the interactive interface; the second display unit is used for displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and the output unit is used for outputting the identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium. The storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the web page identification method according to any one of the embodiments of the present invention.

According to another aspect of the embodiments of the present invention, there is also provided a processor. The processor is used for running the program, wherein the program executes the webpage identification method in the embodiment of the invention when running.

In the embodiment of the invention, target information obtained by accessing a target network resource address is acquired; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information. That is to say, the method and the device forge an abnormal network resource address based on the target network resource address, determine whether the target webpage corresponding to the target network resource address is abnormal or not by accessing the abnormal information obtained by the abnormal network resource address, and avoid directly determining whether the target webpage is abnormal or not based on the returned status code, thereby achieving the purpose of effectively identifying whether the target webpage is abnormal or not, achieving the technical effect of improving the accuracy of identifying the webpage, and further solving the technical problem of low accuracy of identifying the webpage.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of web page identification according to an embodiment of the invention;

FIG. 2 is a flow diagram of another method for web page identification according to an embodiment of the invention;

FIG. 3 is a flow chart of a method for collecting response information of the trues 404 according to an embodiment of the present invention;

FIG. 4 is a flow diagram of a method of spoofing a URL when a file name exists in a true URL in accordance with an embodiment of the present invention;

FIG. 5 is a flow diagram of a method of forging a URL when a filename does not exist in the URL, in accordance with embodiments of the present invention;

FIG. 6 is a flow diagram of a method 404 of identifying a page in accordance with an embodiment of the present invention;

FIG. 7 is a flowchart of an apparatus for determining 404 pages according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a location of 404 relevant content in a web page according to an embodiment of the invention;

FIG. 9 is a diagram illustrating an apparatus for identifying web pages according to an embodiment of the present invention; and

fig. 10 is a schematic diagram of another web page recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided a method embodiment of a web page identification method, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that herein.

Fig. 1 is a flowchart of a web page identification method according to an embodiment of the present invention. As shown in fig. 1, the method may include the steps of:

step S102, obtaining the target information obtained by accessing the target network resource address.

In the technical solution provided by step S102 in the present invention, the target network Resource address is a web page address to be determined whether the target web page obtained by accessing the target network Resource address is abnormal, and may be a Uniform Resource Locator (URL) address, that is, the target network Resource address is an actually existing URL.

In this embodiment, the client accesses the target network resource address, and the client sends a request to the server of the target site where the target network resource address is located. Before the client receives and displays the target webpage corresponding to the target network resource address, the server returns target information obtained by accessing the target network resource address to the client in response to the request, so that the client obtains the target information obtained by accessing the target network resource address. The client may be a browser, and the target information may include a response line including a status code returned by the server, a response header including a type of the target web page and content of the target web page, and response body information.

In this embodiment, the status code, i.e. the return code, may be an HTTP status code for indicating the HTTP response status of the web server, where HTTP is an application layer protocol for distributed, collaborative and hypermedia information systems, and a user may easily access resources on the internet through the HTTP protocol based on web communication.

And step S104, acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address.

In the technical solution provided in step S104 of the present invention, the target network resource address may be analyzed in advance and modified to obtain an abnormal network resource address, where the abnormal network resource address may be a URL, that is, the abnormal network resource address of this embodiment may be a network resource address which is not actually present and is pseudo-created based on a real target network resource address, but may be accessed to obtain abnormal information, and the abnormal information may be stored in advance for subsequent use.

The abnormal information of this embodiment may be response information of the server responding to the request for accessing the abnormal network resource address, for example, 404 state response information, which includes a state code of 404not found, and is a state code returned by the HTTP protocol to a web page error condition, when the user inputs a network resource website in the client, the server may determine whether there is corresponding web page information according to the input network resource address, if there is no corresponding web page information, it indicates that the user inputs a string of invalid links, the server may return 404not found state code to the user, tell the user that there is no corresponding web page information, and may also return information of an effective abnormal target web page type obtained from the abnormal network resource address, and an abnormal web page such as the content of the abnormal target web page, where the abnormal target web page may be 404 page, and the abnormal page can not be analyzed and processed any more, and the input network resource address is deleted in the constructed site tree. Optionally, the response information of the server in response to the request for access to the normal network resource address may include a 200 status code.

After the target information obtained by accessing the target network resource address is acquired, the above-mentioned abnormality information corresponding to the target information may be acquired.

And step S106, determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

In the technical solution provided by step S106 of the present invention, after the abnormal information corresponding to the target information is obtained, whether the target webpage corresponding to the target network resource address is abnormal may be determined based on the abnormal information, and whether the target webpage corresponding to the target network resource address is indeed an abnormal page may be determined by comparing and analyzing the state codes, types, and contents of the target webpage corresponding to the target network resource address and the abnormal page corresponding to the abnormal network resource address, for example, determining whether the target webpage is 404 page may avoid determining whether the target webpage is abnormal based on the returned state code directly, and the embodiment performs fault tolerance processing on the site-defined state code instead of the state code (404not found) of the page of 404 and the site, and returns the current page or the default page when the access is wrong, whether the target webpage is abnormal or not is avoided being determined directly based on the returned state code, so that whether the target webpage is abnormal or not can be accurately identified, and the technical effect of improving the accuracy of identifying the webpage is achieved.

Acquiring target information obtained by accessing the target network resource address through the steps S102 to S106; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information. That is to say, in the embodiment, an abnormal network resource address is forged based on a target network resource address, whether a target webpage corresponding to the target network resource address is abnormal is determined by accessing abnormal information obtained by the abnormal network resource address, and whether the target webpage is abnormal is avoided being directly determined based on a returned abnormal state code, so that the purpose of effectively identifying whether the target webpage is abnormal is achieved, the technical effect of improving the accuracy of identifying the webpage is achieved, and the technical problem of low accuracy of identifying the webpage is solved.

The above-described method of this embodiment is further described below.

As an alternative embodiment, the format of the exception network resource address is the same as the format of the target network resource address.

In this embodiment, the abnormal network resource address is obtained by modifying the target network resource address, and some application programs are very sensitive to the format of the network resource address, and even have different processing flows for network resource addresses with different formats, and the processing flows include 404 processing flows, for example, accessing different directories and files with different suffix names may have different 404 responses, so the key point of this embodiment is how to ensure whether the format of the forged abnormal network resource address is consistent with the format of the real target network resource address. When the embodiment modifies the target network resource address, the format of the target network resource address can be ensured to be unchanged, and the main content in the target network resource address is modified, so that the abnormal network resource address is obtained.

As an optional implementation, the method further comprises: under the condition that the file name exists in the target network resource address, modifying the file name, and determining the target network resource address after the file name is modified as an abnormal network resource address; and under the condition that the file name does not exist in the target network address, randomly generating a target file name, and determining the target network resource address comprising the target file name as an abnormal network resource address.

In this embodiment, when the target network resource address is modified to obtain the abnormal network resource address, it may be determined whether a file name exists in the target network resource address, for example, the file name may be a structured file. If the file name exists in the target network resource address, the file name can be modified under the condition that the format of the target network resource address is ensured to be unchanged, for example, "extension" in file name. Optionally, in this embodiment, the characters of the letter types in the file name are replaced with random letters, the characters of the number types in the file name are replaced with random numbers, and the special symbols in the file name can be kept unchanged, so that the modified file name is obtained, and the target network resource address after the file name is modified is determined as the abnormal network resource address.

Alternatively, if it is determined that no file name exists in the target network resource address, the embodiment may randomly generate a file name, for example, randomly generate a fixed-length target file name containing numbers and letters, and determine the target network resource address including the target file name as the abnormal network resource address.

As an optional implementation manner, in step S106, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: and in the case that the target information does not comprise an abnormal state code corresponding to the abnormal network resource address, determining whether the target webpage is abnormal or not based on the abnormal information.

In this embodiment, the abnormal status code corresponding to the abnormal network resource address may be a 404 status code, and when determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormal information, it may be determined whether the target information includes the abnormal status code, and if it is determined that the target information includes the abnormal status code, it may be determined that the target webpage is an abnormal page, for example, if it is determined that the target information includes the 404 status code, it is determined that the target webpage is a 404 page.

If the target information is judged not to include the abnormal state code, and it may be that the site self-defines the abnormal information (including 404 pages), the target webpage cannot be directly determined to be a normal webpage, and whether the target webpage is abnormal needs to be further determined based on the state code, content and the like included in the abnormal information obtained by accessing the abnormal network resource address.

As an optional implementation manner, in the case that the target information does not include the exception status code, the determining whether the target webpage is abnormal based on the exception information includes: and under the condition that the target information does not comprise the abnormal state code, if the abnormal information comprises the abnormal state code, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, when the target information does not include the abnormal state code, it may be determined whether the abnormal information includes the abnormal state code first, for example, it may be determined whether the abnormal information includes the 404 state code, and if it is determined that the abnormal information includes the abnormal state code, it may be determined that the current target webpage is a non-target abnormal webpage, for example, a non-404 page.

As an optional implementation, the method further comprises: if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information; and under the condition that the first response head value is different from the second response head value, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, if it is determined that the exception information does not include the exception status code, a first response header value in the target information and a second response header value in the exception information need to be further compared, where the first response header value may be used to indicate a Content Type (Content _ Type) of the target web page, and the second response header value may be used to indicate a Content Type of the target exception web page. The embodiment may determine whether the first response header value and the second response header value are the same, that is, determine whether the content types of the target information and the abnormality information are the same. And if the first response head value and the second response head value are judged to be different, determining that the target webpage is a non-target abnormal webpage, for example, determining that the target webpage is a non-404 webpage.

As an optional implementation, the method further comprises: under the condition that the first response head value and the second response head value are the same, acquiring a first similarity between the target information and the abnormal information; determining the target webpage as a target abnormal webpage under the condition that the first similarity is larger than a first threshold value; and under the condition that the first similarity is not larger than a first threshold value, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, if it is determined that the first response header value and the second response header value are the same, a first similarity indicating a degree of similarity between the target information and the abnormal information may be further calculated, for example, a first similarity between an entity (body) of the target information and an entity of the abnormal information may be calculated, and the first similarity may be used to represent a similarity between a target webpage corresponding to the target information and a target abnormal webpage corresponding to the abnormal information. The embodiment may determine whether the first similarity is greater than a first threshold, and the first threshold may be configured according to a specific scenario. If the first similarity is judged to be larger than the first threshold value, namely the similarity between the target information and the abnormal information is higher, the target webpage can be determined to be a target abnormal webpage because the webpage corresponding to the abnormal information is the target abnormal webpage; optionally, if it is determined that the first similarity is not greater than the first threshold, that is, the degree of similarity between the target information and the abnormal information is low, the target webpage is determined to be a non-target abnormal webpage, for example, the target webpage is determined to be a non-404 page.

As an optional implementation manner, in step S106, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: dividing a target abnormal webpage corresponding to the abnormal network resource address into a plurality of first sub-webpages; identifying an abnormal sub-web page comprising abnormal information from a plurality of first sub-web pages; dividing the target webpage into a plurality of second sub-webpages, and determining the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages; acquiring a second similarity between the abnormal sub-web page and the target sub-web page; determining the target webpage as a target abnormal webpage under the condition that the second similarity is larger than a second threshold value; and under the condition that the second similarity is not larger than a second threshold value, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, the content indicating the web page abnormality may only occupy a very small part of the web pages returned by the servers of the partial site, for example, the web pages include title bars, navigation bars, version declarations, unit identifiers, 404 related content, other content, and the like.

In the above case, the content of the HTML of the target abnormal web page may be:

the first similarity between the HTML content of the target abnormal web page and the HTML content of the normal web page may be very large, for example, the first similarity exceeds 99%, so that when the first similarity exceeds a first threshold, the target web page is determined to be the target abnormal web page, but even the normal target web page is determined to be the target abnormal web page, which may cause erroneous determination of the target web page.

In view of the above problem, in the embodiment, the target web page and the target abnormal web page may be analyzed in their entirety and divided into different areas, the target abnormal web page corresponding to the abnormal network resource address may be divided into a plurality of first sub web pages, and the abnormal sub web page including the abnormal information may be identified from the plurality of first sub web pages, for example, the abnormal sub web page may be an area where 404content is located. The embodiment may further divide the target webpage into a plurality of second sub-webpages, and determine the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages.

In the embodiment, the inverse algorithm of the maximum common substring can be adopted to extract the different contents of the target webpage and the target abnormal webpage, analyze and process the contents, remove irrelevant parts, finally obtain the abnormal sub-webpage and the target sub-webpage, and calculate the second similarity between the abnormal sub-webpage and the target sub-webpage, wherein the second similarity can be used as the similarity between the whole target webpage and the target abnormal webpage. In this embodiment, whether the second similarity is greater than a second threshold is determined, if the second similarity is greater than the second threshold, it may be determined that the target webpage is indeed the target abnormal webpage, and if the second similarity is not greater than the second threshold, it may be determined that the target webpage is the non-target abnormal webpage.

For example, the site performs fault-tolerant processing on the access of the user through the client, when the client displays a target webpage, if an error non-existent abnormal network resource address is continuously accessed, the server does not directly return an abnormal status code to the client, but returns the target webpage displayed by the current client and the status code 200 of the target webpage, and the returned webpage displayed by the current client is actually used for indicating that the client accesses the abnormal network resource address and is supposed to be the target abnormal webpage. In this case, the target webpage and the target abnormal webpage of the embodiment are the same, and the target webpage can be determined to be the target abnormal webpage actually through similarity calculation between the target webpage and the target abnormal webpage, but in the prior art, since the state code obtained to return to the target webpage is 200 state codes, rather than 404 state codes, the current target webpage is determined to be a normal webpage based on the 200 state codes, which results in misjudgment that the target webpage is the target abnormal webpage, so that the accuracy of identifying the webpage can be effectively provided through the method of the embodiment, and the technical problem of low accuracy of identifying the webpage is solved.

The embodiment of the invention also provides a flow chart of another webpage identification method from the interactive side.

Fig. 2 is a flowchart of another web page identification method according to an embodiment of the present invention. As shown in fig. 2, the method may include the steps of:

and step S202, displaying target information obtained by accessing the target network resource address in the interactive interface.

In the technical solution provided in step S202 of the present invention, the target network resource address is a web page address to be determined whether a target web page obtained by accessing the target network resource address is abnormal, and may be an actually existing URL.

In this embodiment, the client accesses the target network resource address, and the client sends a request to the server of the target site where the target network resource address is located. Before the client receives and displays the target webpage corresponding to the target network resource address, the server responds to the request and returns the target information obtained by accessing the target network resource address to the client, so that the client obtains the target information obtained by accessing the target network resource address and displays the target information obtained by accessing the target network resource address in the interactive interface. The client may be a browser, and the target information may include a response line including a status code returned by the server, a response header including the type of the target web page and the content of the target web page, and response body information.

And step S204, displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address.

In the technical solution provided in step S204 of the present invention, the target network resource address may be analyzed in advance and modified to obtain an abnormal network resource address, where the abnormal network resource address may be a URL, that is, the abnormal network resource address of this embodiment may be a network resource address which is not actually present and is pseudo-created based on a real target network resource address, but may be accessed to obtain abnormal information, and the abnormal information may be stored in advance for subsequent use.

The exception information of this embodiment may be response information of the server responding to the request for accessing the exception network resource address, for example, 404 status response information. Optionally, the response information of the server responding to the request for access to the normal network resource address may include a 200 status code.

After the target information obtained by accessing the target network resource address is displayed in the interactive interface, the abnormal information corresponding to the target information can be obtained, and the abnormal information corresponding to the target information is displayed in the interactive interface.

Step S206, outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information.

In the technical solution provided by step S206 of the present invention, after the abnormal information corresponding to the target information is displayed in the interactive interface, whether the target webpage corresponding to the target network resource address is abnormal may be determined based on the abnormal information, and whether the target webpage corresponding to the target network resource address is indeed the identification result of the abnormal page may be determined by performing comparison analysis on the state codes, types, and contents of the target webpage corresponding to the target network resource address and the abnormal page corresponding to the abnormal network resource address, for example, whether the target webpage is 404 page is determined, and then the identification result is output, so as to avoid determining whether the target webpage is abnormal based on the returned state code directly, and the embodiment performs fault-tolerant processing on the state code of the website defined by user instead of the state code of the page not 404 (404 notfound) and the website, and returns the current page or the default page when the access is wrong, whether the target webpage is abnormal or not is avoided being determined directly based on the returned state code, so that whether the target webpage is abnormal or not can be accurately identified, and the technical effect of improving the accuracy of identifying the webpage is achieved.

Through the steps S202 to S206, target information obtained by accessing the target network resource address is displayed in the interactive interface; displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information. That is to say, in the embodiment, an abnormal network resource address is forged based on a target network resource address, whether a target webpage corresponding to the target network resource address is abnormal is determined by accessing abnormal information obtained by the abnormal network resource address, and an identification result for indicating whether the target webpage corresponding to the target network resource address is abnormal is output, so that it is avoided that whether the target webpage is abnormal is determined directly based on a returned abnormal state code, and thus the purpose of effectively identifying whether the target webpage is abnormal is achieved, thereby achieving the technical effect of improving the accuracy of identifying the webpage, and further solving the technical problem of low accuracy of identifying the webpage.

Example 2

The web page recognition method of the present invention will be described below by way of example with reference to preferred embodiments.

In this embodiment, when a user accesses a web page through a browser, the browser sends a request to a server of the site where the web page is located. Before the browser receives and displays the web page, the server on which the web page is located returns a response line containing an HTTP status code to respond to the browser's request. Wherein, the HTTP status code is used for representing the response status of the server to the hypertext transfer protocol.

In this embodiment, the exception status code may be a 404 status code, for example, 404not found, which is a status code returned by the HTTP protocol for a web page error condition. When a user inputs a website in a browser, the server judges whether corresponding webpage information exists according to the website input by the user, if the corresponding webpage information does not exist, the fact that the link input by the user is probably a string of invalid links is indicated, and the server returns 404 a state code to the user to tell the user that the corresponding webpage information cannot be found.

The page corresponding to the abnormal state code is a 404 page, and the 404 page plays an important role in the field of data crawlers. When the crawler requests a wrong URL address, if the state code 404 and the page 404 are obtained, the URL address is known to be invalid, the page is not analyzed and processed, and the URL is deleted in the constructed site tree. If the crawler wrongly processes the 404 page as a normal page, resources are wasted (so that the 404 page usually does not contain valid contents), and invalid links are added in the site tree.

Therefore, correctly identifying whether the webpage is 404 pages has an important role in improving the performance and accuracy of the crawler.

In the related art, whether the currently processed target web page is 404 page may be determined according to the HTTP status code. A status bar of the HTTP response may be obtained, and a status code may be extracted therefrom to determine whether the currently processed target web page is 404 pages. However, the 404 page identification technique based on the status code cannot support identification of the target web page in the following cases: a site-customized status code, not the status code of the page (404not found) of 404; and (4) fault tolerance processing is carried out on the site, and the current page or the default page is returned when the access is wrong.

This embodiment provides a web page identification method that improves the above-described HTTP status code based 404 determination method.

This embodiment obtains a consistent, valid, but non-existent URL (false) by parsing the true URL (true), and accesses the true page and the false page. Whether the page (true) is 404 pages is judged through the comparative analysis of the state codes, types and contents of the pages (true) and the pages (false), and therefore the accuracy of identifying the web pages is improved.

The identification method of the web page of this embodiment is further described below.

Fig. 3 is a flowchart of a method for collecting response information of the real object 404 according to an embodiment of the present invention. As shown in fig. 3, the method may include the steps of:

in step S301, a URL where the site really exists is acquired (the status code of the HTTP response is 200).

Step S302, a non-existent URL is forged according to a real URL.

Step S303, accessing the forged URL, obtaining the response information 404 of the site, and storing it for subsequent judgment.

The response information 404 in this embodiment may be information such as a response page 404, a status code 404, and the like.

Since some applications are very sensitive to the format of the URL, and even have different 404 processing flows for different URLs, for example, accessing different directories and files with different suffix names may have different 404 responses, it is necessary to ensure that the forged URL is consistent with the real URL format.

Fig. 4 is a flowchart of a method for forging a URL when a file name of a real URL exists according to an embodiment of the present invention. As shown in fig. 4, the method may include the steps of:

step S401, the characters in the filename of the real URL are obtained.

In this embodiment, if a file name exists in a real URL, when a URL that does not exist is forged, the file name may be structured as a file.

And step S402, judging the type of the acquired character in the filename.

In step S403, the characters of the alphabet type are replaced with random letters.

In step S404, the characters of the numeric type are replaced with random numbers.

Step S405, the special symbol is kept unchanged.

Step S406, determining that the character in the filename is processed.

If the characters in the filename are judged to be not processed completely, executing step S401; if the character in the filename is judged to be processed, step S407 is executed.

In step S407, the forged URL including the modified filename is output.

Fig. 5 is a flowchart of a method for forging a URL when a file name does not exist in the URL according to an embodiment of the present invention. As shown in fig. 5, the method may include the steps of:

in step S501, a file name is randomly generated.

In the case where no file name exists in the real URL, a fixed-length file name (including letters and numbers) including letters and numbers can be randomly generated.

In step S502, a forged URL including a randomly generated file name is output.

Through the above processing, a forged URL basically conforming to the format of the real URL can be obtained.

And then accessing the forged URL to obtain a valid 404 page corresponding to the forged URL, comparing the valid 404 page with a target webpage needing to be processed, and judging whether the target webpage needing to be processed is the 404 page.

FIG. 6 is a flow chart of a method 404 for identifying pages according to an embodiment of the invention. As shown in fig. 6, the method may include the steps of:

in step S601, the status code of the target web page being processed is acquired.

In this embodiment, the target web page being processed, that is, the web page that needs to be determined whether it is abnormal or not, may be represented by Response.

In step S602, it is determined whether the status code of the target web page is 404.

Step S603, directly determining the target web page as 404 pages.

If the status code of the target webpage is judged to be 404 status codes, the target webpage can be directly determined to be 404 pages.

In step S604, the status code of the forged valid 404 page is acquired.

If the state code of the target webpage is judged not to be the 404 state code, the target webpage may be the 404 page defined by the site, and the state code of the forged effective 404 page is obtained.

The forged valid 404 page of this embodiment may be represented by Response 404.

In step S605, it is determined whether the status code of the valid 404 page of the forged URL is 404 status code.

In step S606, it is determined that the target web page is not the 404 page.

And if the status code of the target webpage is judged not to be the 404 status code, and the status code of the valid 404 page of the forged URL is judged to be the 404 status code, determining that the target webpage is not the 404 page.

In step S607, it is determined whether the HTTP response header values of the target web page and the valid 404 page of the forged URL are the same.

If the status code of the valid 404 page of the forged URL is not determined to be 404 status code, determining whether the HTTP response header values of the target webpage and the valid 404 page of the forged URL are the same.

This embodiment may compare whether the content types of the target web page and the valid 404 page of the forged URL are the same, in determining whether the HTTP response header values of the target web page and the valid 404 page of the forged URL are the same.

In step S608, it is determined that the target web page is not the 404 page.

And if the HTTP response head values of the target webpage and the valid 404 page of the forged URL are judged to be different, determining that the target webpage is not the 404 page.

In step S609, the similarity between the target web page and the valid 404 page of the forged URL is calculated.

If the HTTP response header values of the target webpage and the valid 404 pages of the forged URL are judged to be the same, the similarity between the target webpage and the valid 404 pages of the forged URL is calculated.

The similarity of the entity body of the valid 404 page of the target web page and the forged URL may be compared.

In step S610, it is determined whether the similarity between the target web page and the valid 404 page of the forged URL is greater than a threshold.

The threshold value of this embodiment may be configured according to a specific scenario.

In step S611, it is determined that the target web page is not 404 pages.

If the similarity between the target webpage and the valid 404 pages of the forged URL is judged to be not greater than the threshold value, the target webpage is determined not to be 404 pages.

In step S612, the target web page is determined to be 404 pages.

If the similarity between the target webpage and the valid 404 pages of the forged URL is judged to be larger than the threshold value, the target webpage is determined to be 404 pages.

Fig. 7 is a flowchart of a device for determining 404 pages according to an embodiment of the present invention. As shown in fig. 7, the judging device 70 for page 404 may include: a URL falsification unit 71 and a page analysis processing unit 72.

The device for judging 404 page in this embodiment may be configured to execute the method for identifying 404 page in the embodiment of the present invention. Among them, the URL falsification unit 71 may be configured to execute the method of falsifiing a URL when a file name exists in a real URL shown in fig. 4 and the method of falsifiing a URL when a file name does not exist in a URL shown in fig. 5 according to the embodiment of the present invention. The page analysis processing unit 72 may be configured to execute the method for identifying the page 404 shown in fig. 6 according to the embodiment of the present invention, and finally output a result indicating whether the target web page is abnormal.

When the method and the apparatus provided by the present invention are applied in this embodiment, the following optimization schemes may be considered.

In the return page of the partial site, the content related to 404 is only a small part of the ratio, as shown in fig. 8, where fig. 8 is a schematic diagram of a position of the content related to 404 in the web page according to the embodiment of the present invention, and the web page where the content related to 404 is located further includes a title bar, a navigation bar, a copyright notice, a unit identifier, and other content.

the HTML content of a normal web page may be:

the first similarity calculated from the HTML content of the target abnormal web page and the HTML content of the normal web page may be very large, for example, the first similarity exceeds 99%, so that when the first similarity exceeds the first threshold, the target web page is determined as the target abnormal web page, and this may cause that even the normal target web page is determined as the target abnormal web page, which may result in misjudgment of the target web page.

For such a situation, in this embodiment, before calculating the similarity, the target web page and the valid 404 pages of the forged URLs are analyzed in an integrated manner, so that the valid 404 pages of the forged URLs can be divided into different regions, and the region where the content is located can be identified and extracted 404. The method adopts the inverse algorithm of the maximum common substring to extract different contents between the target webpage and the effective 404 page of the forged URL, analyzes and processes, removes irrelevant parts, finally obtains the area where the 404 related contents in the effective 404 page of the forged URL are located and the corresponding area in the target webpage, calculates the similarity between the two areas, and takes the similarity as the similarity of the whole target webpage and the effective 404 page of the forged URL to be brought into the subsequent calculation.

An application scenario of this embodiment is exemplified below.

The embodiment selects the application scenario of a classical web information crawler as an example analysis: the target site is fault tolerant to user access and does not return 404 directly but returns the current page when an incorrect, non-existent link is accessed. In this case, the target web page obtained in the embodiment is actually identical to the valid 404 page of the forged URL, and the page can be determined to be 404 page through similarity calculation. The conventional 404 page recognition technique obtains a status code of 200 for the returned page, so that the current page is determined as a normal page by mistake.

The embodiment implements analysis processing on a real URL through the above-mentioned falsification algorithm for keeping file names in a consistent format to obtain a forged URL in a consistent format, which is effective but does not exist, and can determine whether a target webpage is a 404 page through a 404 page determination method based on page content analysis and an optimized 404 page determination method for a large page, so that a state code of a website can be supported, which is self-defined, but not a state code of a 404 page, and a website is subjected to fault-tolerant processing, and a current page or a default page is returned when an access error occurs, thereby improving the recognition accuracy of the website 404 page and the wrong URL, and solving the technical problem of low accuracy in recognizing the webpage.

Example 3

The embodiment of the invention also provides a webpage identification device. It should be noted that the web page identification apparatus of this embodiment may be used to execute the web page identification method of the embodiment shown in fig. 1 of the present invention.

Fig. 9 is a schematic diagram of a web page recognition apparatus according to an embodiment of the present invention. As shown in fig. 9, the web page identification device 90 may include: a first acquisition unit 91, a second acquisition unit 92, and a third acquisition unit 93.

A first obtaining unit 91, configured to obtain target information obtained by accessing the target network resource address.

A second obtaining unit 92, configured to obtain exception information corresponding to the target information, where the exception information is obtained in advance by accessing an exception network resource address, and the exception network resource address is obtained by modifying the target network resource address.

And a third obtaining unit 93, configured to determine, based on the exception information, whether the target web page corresponding to the target network resource address is abnormal.

The embodiment of the invention also provides another webpage identification device. It should be noted that the web page identification apparatus of this embodiment may be used to execute the web page identification method of the embodiment shown in fig. 2 of the present invention.

Fig. 10 is a schematic diagram of another web page recognition apparatus according to an embodiment of the present invention. As shown in fig. 10, the web page recognition apparatus 100 may include: a first display unit 101, a second display unit 102, and an output unit 103.

And the first display unit 101 is configured to display target information obtained by accessing the target network resource address in the interactive interface.

The second display unit 102 is configured to display abnormal information corresponding to the target information in the interactive interface, where the abnormal information is obtained in advance by accessing an abnormal network resource address, and the abnormal network resource address is obtained by modifying the target network resource address.

The output unit 103 is configured to output a recognition result, where the recognition result is used to indicate whether the target webpage corresponding to the target network resource address is abnormal or not, and is obtained through abnormal information.

The webpage identification device of the embodiment forges the abnormal network resource address based on the target network resource address, determines whether the target webpage corresponding to the target network resource address is abnormal or not through the abnormal information obtained by accessing the abnormal network resource address, and avoids the situation that whether the target webpage is abnormal or not directly determined based on the returned abnormal state code, so that the purpose of effectively identifying whether the target webpage is abnormal or not is achieved, the technical effect of improving the accuracy of identifying the webpage is achieved, and the technical problem of low accuracy of identifying the webpage is solved.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technical content can be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying a web page, comprising:

acquiring target information obtained by accessing a target network resource address;

acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address, and the format of the abnormal network resource address is the same as that of the target network resource address;

determining whether a target webpage corresponding to the target network resource address is abnormal based on the abnormal information, wherein determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormal information comprises: determining whether the target web page is abnormal based on the abnormality information in a case where the target information does not include an abnormality status code corresponding to the abnormal network resource address, wherein determining whether the target web page is abnormal based on the abnormality information in a case where the target information does not include an abnormality status code corresponding to the abnormal network resource address includes: if the target information does not include the abnormal state code, if the abnormal information includes the abnormal state code, determining that the target webpage is a non-target abnormal webpage, if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information, if the first response head value and the second response head value are not the same, determining that the target webpage is the non-target abnormal webpage, if the first response head value and the second response head value are the same, acquiring a first similarity between the target information and the abnormal information, if the first similarity is greater than a first threshold, determining that the target webpage is the target abnormal webpage, if the first similarity is not greater than the first threshold, determining the target webpage to be the non-target abnormal webpage;

under the condition that a file name exists in the target network resource address, modifying the file name, and determining the target network resource address with the modified file name as the abnormal network resource address;

and under the condition that no file name exists in the target network address, randomly generating a target file name, and determining the target network resource address comprising the target file name as the abnormal network resource address.

2. The method of claim 1, wherein determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information comprises:

dividing the target abnormal webpage corresponding to the abnormal network resource address into a plurality of first sub-webpages;

identifying an abnormal sub-web page comprising the abnormal information from the plurality of first sub-web pages;

dividing the target webpage into a plurality of second sub-webpages, and determining the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages;

acquiring a second similarity between the abnormal sub-web page and the target sub-web page;

determining the target webpage to be a target abnormal webpage under the condition that the second similarity is larger than a second threshold value;

and under the condition that the second similarity is not larger than a second threshold value, determining that the target webpage is a non-target abnormal webpage.

3. A method for identifying a web page, comprising:

displaying target information obtained by accessing the target network resource address in the interactive interface;

displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address, and the format of the abnormal network resource address is the same as that of the target network resource address;

outputting a recognition result, wherein the recognition result is used for indicating whether a target webpage corresponding to the target network resource address is abnormal or not and is obtained through the abnormal information, wherein in the case that the target information does not include an abnormal status code corresponding to the abnormal network resource address, determining whether the target webpage is abnormal or not based on the abnormal information comprises: if the target information does not include the abnormal state code, if the abnormal information includes the abnormal state code, determining that the target webpage is a non-target abnormal webpage, if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information, if the first response head value is not the same as the second response head value, determining that the target webpage is the non-target abnormal webpage, if the first response head value is the same as the second response head value, acquiring a first similarity between the target information and the abnormal information, if the first similarity is greater than a first threshold, determining that the target webpage is the target abnormal webpage, if the first similarity is not greater than the first threshold, determining the target webpage to be the non-target abnormal webpage;

modifying the file name under the condition that the file name exists in the target network resource address, and determining the target network resource address after the file name is modified as the abnormal network resource address;

4. A web page recognition apparatus, comprising:

a first acquisition unit configured to acquire target information obtained by accessing a target network resource address;

a second obtaining unit, configured to obtain exception information corresponding to the target information, where the exception information is obtained in advance by accessing an exception network resource address, and the exception network resource address is obtained by modifying the target network resource address, and a format of the exception network resource address is the same as a format of the target network resource address;

a third obtaining unit, configured to determine, based on the exception information, whether a target web page corresponding to the target network resource address is abnormal, where the third obtaining unit is configured to determine, based on the exception information, whether the target web page corresponding to the target network resource address is abnormal, by: determining whether the target web page is abnormal based on the abnormality information in a case where the target information does not include an abnormality status code corresponding to the abnormal network resource address, wherein determining whether the target web page is abnormal based on the abnormality information in a case where the target information does not include an abnormality status code corresponding to the abnormal network resource address includes: if the target information does not include the abnormal state code, if the abnormal information includes the abnormal state code, determining that the target webpage is a non-target abnormal webpage, if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information, if the first response head value is not the same as the second response head value, determining that the target webpage is the non-target abnormal webpage, if the first response head value is the same as the second response head value, acquiring a first similarity between the target information and the abnormal information, if the first similarity is greater than a first threshold, determining that the target webpage is the target abnormal webpage, if the first similarity is not greater than the first threshold, determining the target webpage to be the non-target abnormal webpage;

the device is further used for modifying the file name under the condition that the file name exists in the target network resource address, and determining the target network resource address with the modified file name as the abnormal network resource address;

the device is further configured to randomly generate a target file name when no file name exists in the target network address, and determine the target network resource address including the target file name as the abnormal network resource address.

5. A web page recognition apparatus, comprising:

the first display unit is used for displaying target information obtained by accessing the target network resource address in the interactive interface;

the second display unit is used for displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, the abnormal network resource address is obtained by modifying the target network resource address, and the format of the abnormal network resource address is the same as that of the target network resource address;

an output unit, configured to output a recognition result, where the recognition result is used to indicate whether a target webpage corresponding to the target network resource address is abnormal and is obtained through the abnormal information, and the recognition result is used to indicate whether the target webpage corresponding to the target network resource address is abnormal and is obtained through the abnormal information, where the obtaining of the abnormal information includes: determining whether the target web page is abnormal based on the abnormality information in a case where the target information does not include an abnormality status code corresponding to the abnormal network resource address, wherein determining whether the target web page is abnormal based on the abnormality information in a case where the target information does not include an abnormality status code corresponding to the abnormal network resource address includes: if the target information does not include the abnormal state code, if the abnormal information includes the abnormal state code, determining that the target webpage is a non-target abnormal webpage, if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information, if the first response head value is not the same as the second response head value, determining that the target webpage is the non-target abnormal webpage, if the first response head value is the same as the second response head value, acquiring a first similarity between the target information and the abnormal information, if the first similarity is greater than a first threshold, determining that the target webpage is the target abnormal webpage, if the first similarity is not greater than the first threshold, determining the target webpage to be the non-target abnormal webpage;

the device is further configured to modify the file name when the target network resource address has the file name, and determine the target network resource address after the file name is modified as the abnormal network resource address;

6. A computer-readable storage medium, comprising a stored program, wherein when the program runs, the computer-readable storage medium controls a device to execute the web page identification method according to any one of claims 1 to 3.

7. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the web page identification method according to any one of claims 1 to 3 when running.