CN111177619A

CN111177619A - Webpage identification method and device, storage medium and processor

Info

Publication number: CN111177619A
Application number: CN201911320564.2A
Authority: CN
Inventors: 蒋自立; 贺志强; 许勇
Original assignee: Hillstone Networks Corp
Current assignee: Hillstone Networks Corp
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2020-05-19
Anticipated expiration: 2039-12-19
Also published as: CN111177619B

Abstract

The invention discloses a webpage identification method, a webpage identification device, a storage medium and a processor. Wherein, the method comprises the following steps: acquiring target information obtained by accessing a target network resource address; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information. The invention solves the technical problem of low accuracy of webpage identification.

Description

Webpage identification method and device, storage medium and processor

Technical Field

The invention relates to the field of internet, in particular to a webpage identification method, a webpage identification device, a webpage identification storage medium and a webpage identification processor.

Background

Currently, when a web page is identified, it is usually determined whether the web page is an abnormal page according to a hypertext transfer protocol (HTTP) status code, a status bar of an HTTP response (HTTP) may be obtained, the status code is extracted from the status bar, and whether the web page to be processed is an abnormal page is directly determined based on the status code, for example, when a returned status code is an abnormal status code, it is directly determined that the web page is abnormal, and when the returned status code is not an abnormal status code, it is directly determined that the web page is normal.

However, the above method cannot effectively identify whether the web page is abnormal or not for the self-defined status code of the site, but not for the status code of the abnormal page; in addition, if the site is subjected to fault-tolerant processing, the server may return to the current page or the default page under the condition of a wrong access page, so that the current webpage is determined to be a normal webpage by mistake, the webpage cannot be effectively identified to be an abnormal webpage, and the technical problem of low accuracy in identifying the webpage exists.

Aiming at the technical problem of low accuracy of webpage identification, no effective solution is provided at present.

Disclosure of Invention

The embodiment of the invention provides a webpage identification method, a webpage identification device, a storage medium and a processor, and at least solves the technical problem of low accuracy of webpage identification.

According to one aspect of the embodiment of the invention, a webpage identification method is provided. The method comprises the following steps: the webpage identification method is characterized by comprising the following steps: acquiring target information obtained by accessing a target network resource address; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

Optionally, the format of the exception network resource address is the same as the format of the target network resource address.

Optionally, the method further comprises: under the condition that the file name exists in the target network resource address, modifying the file name, and determining the target network resource address after the file name is modified as an abnormal network resource address; and under the condition that the file name does not exist in the target network address, randomly generating a target file name, and determining the target network resource address comprising the target file name as an abnormal network resource address.

Optionally, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: and in the case that the target information does not comprise an abnormal state code corresponding to the abnormal network resource address, determining whether the target webpage is abnormal or not based on the abnormal information.

Optionally, in a case where the target information does not include the exception status code, determining whether the target web page is abnormal based on the exception information includes: and under the condition that the target information does not comprise the abnormal state code, if the abnormal information comprises the abnormal state code, determining that the target webpage is a non-target abnormal webpage.

Optionally, the method further comprises: if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information; and under the condition that the first response head value is different from the second response head value, determining that the target webpage is a non-target abnormal webpage.

Optionally, the method further comprises: under the condition that the first response head value and the second response head value are the same, acquiring a first similarity between the target information and the abnormal information; determining the target webpage as a target abnormal webpage under the condition that the first similarity is larger than a first threshold value; and under the condition that the first similarity is not larger than a first threshold value, determining that the target webpage is a non-target abnormal webpage.

Optionally, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: dividing a target abnormal webpage corresponding to the abnormal network resource address into a plurality of first sub-webpages; identifying an abnormal sub-web page comprising abnormal information from a plurality of first sub-web pages; dividing the target webpage into a plurality of second sub-webpages, and determining the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages; acquiring a second similarity between the abnormal sub-web page and the target sub-web page; determining the target webpage as a target abnormal webpage under the condition that the second similarity is larger than a second threshold value; and under the condition that the second similarity is not larger than a second threshold value, determining that the target webpage is a non-target abnormal webpage.

According to another aspect of the embodiment of the invention, another webpage identification method is also provided. The method can comprise the following steps: displaying target information obtained by accessing the target network resource address in the interactive interface; displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information.

According to another aspect of the embodiment of the invention, a webpage identification device is also provided. The device includes: a first acquisition unit configured to acquire target information obtained by accessing a target network resource address; the second acquisition unit is used for acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and the third acquisition unit is used for determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

According to another aspect of the embodiment of the invention, another webpage identification device is also provided. The device includes: the first display unit is used for displaying target information obtained by accessing the target network resource address in the interactive interface; the second display unit is used for displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and the output unit is used for outputting the identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium. The storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the web page identification method according to any one of the embodiments of the present invention.

According to another aspect of the embodiments of the present invention, there is also provided a processor. The processor is used for running the program, wherein the program executes the webpage identification method in the embodiment of the invention when running.

In the embodiment of the invention, target information obtained by accessing a target network resource address is acquired; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information. That is to say, the method and the device forge an abnormal network resource address based on the target network resource address, determine whether the target webpage corresponding to the target network resource address is abnormal or not by accessing the abnormal information obtained by the abnormal network resource address, and avoid directly determining whether the target webpage is abnormal or not based on the returned status code, thereby achieving the purpose of effectively identifying whether the target webpage is abnormal or not, achieving the technical effect of improving the accuracy of identifying the webpage, and further solving the technical problem of low accuracy of identifying the webpage.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of a method of web page identification according to an embodiment of the invention;

FIG. 2 is a flow diagram of another method of web page identification according to an embodiment of the invention;

FIG. 3 is a flow chart of a method for collecting response information of the trues 404 according to an embodiment of the present invention;

FIG. 4 is a flow diagram of a method of spoofing a URL when a file name exists in a true URL in accordance with an embodiment of the present invention;

FIG. 5 is a flow diagram of a method of forging a URL when a filename does not exist in the URL, in accordance with embodiments of the present invention;

FIG. 6 is a flow diagram of a method 404 of identifying a page according to an embodiment of the invention;

FIG. 7 is a flowchart of an apparatus for determining 404 pages according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a location of 404 relevant content in a web page according to an embodiment of the invention;

FIG. 9 is a diagram illustrating an apparatus for identifying web pages according to an embodiment of the present invention; and

fig. 10 is a schematic diagram of another web page recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided a method embodiment of a web page identification method, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than that herein.

Fig. 1 is a flowchart of a web page identification method according to an embodiment of the present invention. As shown in fig. 1, the method may include the steps of:

step S102, obtaining the target information obtained by accessing the target network resource address.

In the technical solution provided in step S102 of the present invention, the target network resource address is a web page address to determine whether a target web page obtained by accessing the target web page is abnormal, and may be a Uniform Resource Locator (URL) address, that is, the target network resource address is an actually existing URL.

In this embodiment, the client accesses the target network resource address, and the client sends a request to the server of the target site where the target network resource address is located. Before the client receives and displays the target webpage corresponding to the target network resource address, the server returns target information obtained by accessing the target network resource address to the client in response to the request, so that the client obtains the target information obtained by accessing the target network resource address. The client may be a browser, and the target information may include a response line including a status code returned by the server, a response header including a type of the target web page and content of the target web page, and response body information.

In this embodiment, the status code, i.e. the return code, may be an HTTP status code for indicating the HTTP response status of the web server, where HTTP is an application layer protocol for distributed, collaborative and hypermedia information systems, and a user may easily access resources on the internet through the HTTP protocol based on web communication.

And step S104, acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address.

In the technical solution provided in step S104 of the present invention, the target network resource address may be analyzed in advance and modified to obtain an abnormal network resource address, where the abnormal network resource address may be a URL, that is, the abnormal network resource address of this embodiment may be a network resource address which is not actually present and is pseudo-created based on a real target network resource address, but may be accessed to obtain abnormal information, and the abnormal information may be stored in advance for subsequent use.

The abnormal information of this embodiment may be response information of the server responding to the request for accessing the abnormal network resource address, for example, the response information is 404 state response information, which includes a state code of 404not found, which is a state code returned by the HTTP protocol to the error condition of the web page, when the user inputs a network resource website in the client, the server may determine whether there is corresponding web page information according to the input network resource address, if there is no corresponding web page information, it indicates that the user inputs a string of invalid links, the server may return the state code of 404not found to the user, tell the user that the corresponding web page information is not found, and may also return information of an abnormal web page, such as a type of an effective target abnormal web page obtained from the abnormal network resource address, and a content of the target abnormal web page, where the target abnormal web page may be the 404 page, and the abnormal page can not be analyzed and processed any more, and the input network resource address is deleted in the constructed site tree. Optionally, the response information of the server in response to the request for access to the normal network resource address may include a 200 status code.

After the target information obtained by accessing the target network resource address is acquired, the above-mentioned abnormality information corresponding to the target information may be acquired.

And step S106, determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

In the technical solution provided by step S106 of the present invention, after the abnormal information corresponding to the target information is obtained, whether the target webpage corresponding to the target network resource address is abnormal may be determined based on the abnormal information, and whether the target webpage corresponding to the target network resource address is indeed an abnormal page may be determined by comparing and analyzing the state codes, types, and contents of the target webpage corresponding to the target network resource address and the abnormal page corresponding to the abnormal network resource address, for example, determining whether the target webpage is 404 page may avoid determining whether the target webpage is abnormal based on the returned state code directly, and the embodiment performs fault tolerance processing on the site-defined state code instead of the state code (404not found) of the page of 404 and the site, and returns the current page or the default page when the access is wrong, whether the target webpage is abnormal or not is avoided being determined directly based on the returned state code, so that whether the target webpage is abnormal or not can be accurately identified, and the technical effect of improving the accuracy of identifying the webpage is achieved.

Through the steps S102 to S106, target information obtained by accessing the target network resource address is acquired; acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information. That is to say, in the embodiment, an abnormal network resource address is forged based on a target network resource address, whether a target webpage corresponding to the target network resource address is abnormal is determined by accessing abnormal information obtained by the abnormal network resource address, and whether the target webpage is abnormal is avoided being directly determined based on a returned abnormal state code, so that the purpose of effectively identifying whether the target webpage is abnormal is achieved, the technical effect of improving the accuracy of identifying the webpage is achieved, and the technical problem of low accuracy of identifying the webpage is solved.

The above-described method of this embodiment is further described below.

As an alternative embodiment, the format of the exception network resource address is the same as the format of the target network resource address.

In this embodiment, the abnormal network resource address is obtained by modifying the target network resource address, and some application programs are very sensitive to the format of the network resource address, and even have different processing flows for network resource addresses with different formats, and the processing flows include 404 processing flows, for example, accessing different directories and files with different suffix names may have different 404 responses, so the key point of this embodiment is how to ensure whether the format of the forged abnormal network resource address is consistent with the format of the real target network resource address. When the embodiment modifies the target network resource address, the format of the target network resource address can be ensured to be unchanged, and the main content in the target network resource address is modified, so that the abnormal network resource address is obtained.

As an optional implementation, the method further comprises: under the condition that the file name exists in the target network resource address, modifying the file name, and determining the target network resource address after the file name is modified as an abnormal network resource address; and under the condition that the file name does not exist in the target network address, randomly generating a target file name, and determining the target network resource address comprising the target file name as an abnormal network resource address.

In this embodiment, when the target network resource address is modified to obtain the abnormal network resource address, it may be determined whether a file name exists in the target network resource address, for example, the file name may be a structured file. If the file name exists in the target network resource address, the file name can be modified under the condition that the format of the target network resource address is ensured to be unchanged, for example, "extension" in file name. Optionally, in this embodiment, the characters of the letter types in the file name are replaced with random letters, the characters of the number types in the file name are replaced with random numbers, and the special symbols in the file name can be kept unchanged, so that the modified file name is obtained, and the target network resource address after the file name is modified is determined as the abnormal network resource address.

Alternatively, if it is determined that no file name exists in the target network resource address, the embodiment may randomly generate a file name, for example, randomly generate a fixed-length target file name containing numbers and letters, and determine the target network resource address including the target file name as the abnormal network resource address.

As an optional implementation manner, in step S106, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: and in the case that the target information does not comprise an abnormal state code corresponding to the abnormal network resource address, determining whether the target webpage is abnormal or not based on the abnormal information.

In this embodiment, the abnormal status code corresponding to the abnormal network resource address may be a 404 status code, and when determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormal information, it may be determined whether the target information includes the abnormal status code, and if it is determined that the target information includes the abnormal status code, it may be determined that the target webpage is an abnormal page, for example, if it is determined that the target information includes the 404 status code, it is determined that the target webpage is a 404 page.

If the target information is judged not to include the abnormal state code, and the site may customize the abnormal information (including 404 pages), the target webpage cannot be directly determined to be a normal webpage, and whether the target webpage is abnormal needs to be further determined based on the state code, the content and the like included in the abnormal information obtained by accessing the abnormal network resource address.

As an optional implementation manner, in the case that the target information does not include the exception status code, the determining whether the target webpage is abnormal based on the exception information includes: and under the condition that the target information does not comprise the abnormal state code, if the abnormal information comprises the abnormal state code, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, when the target information does not include the abnormal state code, it may be determined whether the abnormal information includes the abnormal state code first, for example, it may be determined whether the abnormal information includes the 404 state code, and if it is determined that the abnormal information includes the abnormal state code, it may be determined that the current target webpage is a non-target abnormal webpage, for example, a non-404 page.

As an optional implementation, the method further comprises: if the abnormal information does not include the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information; and under the condition that the first response head value is different from the second response head value, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, if it is determined that the exception information does not include the exception status code, a first response header value in the target information and a second response header value in the exception information need to be further compared, where the first response header value may be used to indicate a Content Type (Content _ Type) of the target web page, and the second response header value may be used to indicate a Content Type of the target exception web page. The embodiment may determine whether the first response header value and the second response header value are the same, that is, determine whether the content types of the target information and the abnormality information are the same. And if the first response head value and the second response head value are judged to be different, determining that the target webpage is a non-target abnormal webpage, for example, determining that the target webpage is a non-404 webpage.

As an optional implementation, the method further comprises: under the condition that the first response head value and the second response head value are the same, acquiring a first similarity between the target information and the abnormal information; determining the target webpage as a target abnormal webpage under the condition that the first similarity is larger than a first threshold value; and under the condition that the first similarity is not larger than a first threshold value, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, if it is determined that the first response header value and the second response header value are the same, a first similarity indicating a degree of similarity between the target information and the abnormal information may be further calculated, for example, a first similarity between an entity (body) of the target information and an entity of the abnormal information may be calculated, and the first similarity may be used to represent a similarity between a target webpage corresponding to the target information and a target abnormal webpage corresponding to the abnormal information. The embodiment may determine whether the first similarity is greater than a first threshold, and the first threshold may be configured according to a specific scenario. If the first similarity is judged to be larger than the first threshold value, namely the similarity between the target information and the abnormal information is higher, the target webpage can be determined to be a target abnormal webpage because the webpage corresponding to the abnormal information is the target abnormal webpage; optionally, if it is determined that the first similarity is not greater than the first threshold, that is, the degree of similarity between the target information and the abnormal information is low, the target webpage is determined to be a non-target abnormal webpage, for example, the target webpage is determined to be a non-404 page.

As an optional implementation manner, in step S106, determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information includes: dividing a target abnormal webpage corresponding to the abnormal network resource address into a plurality of first sub-webpages; identifying an abnormal sub-web page comprising abnormal information from a plurality of first sub-web pages; dividing the target webpage into a plurality of second sub-webpages, and determining the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages; acquiring a second similarity between the abnormal sub-web page and the target sub-web page; determining the target webpage as a target abnormal webpage under the condition that the second similarity is larger than a second threshold value; and under the condition that the second similarity is not larger than a second threshold value, determining that the target webpage is a non-target abnormal webpage.

In this embodiment, the content indicating the web page abnormality may only occupy a very small part of the web pages returned by the servers of the partial site, for example, the web pages include title bars, navigation bars, version declarations, unit identifiers, 404 related content, other content, and the like.

In the above case, the content of the HTML of the target abnormal web page may be:

the first similarity between the HTML content of the target abnormal web page and the HTML content of the normal web page may be very large, for example, the first similarity exceeds 99%, so that when the first similarity exceeds a first threshold, the target web page is determined to be the target abnormal web page, but even the normal target web page is determined to be the target abnormal web page, which may cause erroneous determination of the target web page.

In view of the above problem, in the embodiment, the target web page and the target abnormal web page may be analyzed in their entirety and divided into different regions, and the target abnormal web page corresponding to the abnormal network resource address may be divided into a plurality of first sub web pages, and the abnormal sub web page including the abnormal information is identified from the plurality of first sub web pages, for example, the abnormal sub web page may be a region where 404content is located. The embodiment may further divide the target webpage into a plurality of second sub-webpages, and determine the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages.

In the embodiment, the inverse algorithm of the maximum common substring can be adopted to extract the different contents of the target webpage and the target abnormal webpage, analyze and process the different contents, remove irrelevant parts, finally obtain the abnormal sub-webpage and the target sub-webpage, and calculate the second similarity between the abnormal sub-webpage and the target sub-webpage, wherein the second similarity can be used as the similarity between the whole target webpage and the target abnormal webpage. In this embodiment, whether the second similarity is greater than a second threshold is determined, if the second similarity is greater than the second threshold, it may be determined that the target webpage is indeed the target abnormal webpage, and if the second similarity is not greater than the second threshold, it may be determined that the target webpage is the non-target abnormal webpage.

For example, the site performs fault-tolerant processing on the access of the user through the client, when the client displays a target webpage, if an error non-existent abnormal network resource address is continuously accessed, the server does not directly return an abnormal status code to the client, but returns the target webpage displayed by the current client and the status code 200 of the target webpage, and the returned webpage displayed by the current client is actually used for indicating that the client accesses the abnormal network resource address and is supposed to be the target abnormal webpage. In this case, the target webpage and the target abnormal webpage of the embodiment are the same, and the target webpage can be determined to be the target abnormal webpage actually through similarity calculation between the target webpage and the target abnormal webpage, but in the prior art, since the state code obtained to return to the target webpage is 200 state codes, rather than 404 state codes, the current target webpage is determined to be a normal webpage based on the 200 state codes, which results in misjudgment that the target webpage is the target abnormal webpage, so that the accuracy of identifying the webpage can be effectively provided through the method of the embodiment, and the technical problem of low accuracy of identifying the webpage is solved.

The embodiment of the invention also provides a flow chart of another webpage identification method from the interactive side.

Fig. 2 is a flowchart of another web page identification method according to an embodiment of the present invention. As shown in fig. 2, the method may include the steps of:

and step S202, displaying target information obtained by accessing the target network resource address in the interactive interface.

In the technical solution provided in step S202 of the present invention, the target network resource address is a web page address to be determined whether a target web page obtained by accessing the target network resource address is abnormal, and may be an actually existing URL.

In this embodiment, the client accesses the target network resource address, and the client sends a request to the server of the target site where the target network resource address is located. Before the client receives and displays the target webpage corresponding to the target network resource address, the server responds to the request and returns the target information obtained by accessing the target network resource address to the client, so that the client obtains the target information obtained by accessing the target network resource address and displays the target information obtained by accessing the target network resource address in the interactive interface. The client may be a browser, and the target information may include a response line including a status code returned by the server, a response header including a type of the target web page and content of the target web page, and response body information.

And step S204, displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address.

In the technical solution provided in step S204 of the present invention, the target network resource address may be analyzed in advance and modified to obtain an abnormal network resource address, where the abnormal network resource address may be a URL, that is, the abnormal network resource address of this embodiment may be a network resource address which is not actually present and is pseudo-created based on a real target network resource address, but may be accessed to obtain abnormal information, and the abnormal information may be stored in advance for subsequent use.

The exception information of this embodiment may be response information of the server responding to the request for accessing the exception network resource address, for example, 404 status response information. Optionally, the response information of the server in response to the request for access to the normal network resource address may include a 200 status code.

After the target information obtained by accessing the target network resource address is displayed in the interactive interface, the abnormal information corresponding to the target information can be obtained, and the abnormal information corresponding to the target information is displayed in the interactive interface.

Step S206, outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information.

In the technical solution provided by step S206 of the present invention, after the abnormal information corresponding to the target information is displayed in the interactive interface, whether the target webpage corresponding to the target network resource address is abnormal may be determined based on the abnormal information, and whether the target webpage corresponding to the target network resource address is indeed the identification result of the abnormal page may be determined by performing comparison analysis on the state codes, types, and contents of the target webpage corresponding to the target network resource address and the abnormal page corresponding to the abnormal network resource address, for example, whether the target webpage is 404 page is determined, and then the identification result is output, so as to avoid determining whether the target webpage is abnormal based on the returned state code directly, and the embodiment performs fault-tolerant processing on the state code of the website defined by user instead of the state code of the page not 404 (404 notfound) and the website, and returns the current page or the default page when the access is wrong, whether the target webpage is abnormal or not is avoided being determined directly based on the returned state code, so that whether the target webpage is abnormal or not can be accurately identified, and the technical effect of improving the accuracy of identifying the webpage is achieved.

Through the steps S202 to S206, target information obtained by accessing the target network resource address is displayed in the interactive interface; displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address; and outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through abnormal information. That is to say, in the embodiment, an abnormal network resource address is forged based on a target network resource address, whether a target webpage corresponding to the target network resource address is abnormal is determined by accessing abnormal information obtained by the abnormal network resource address, and an identification result for indicating whether the target webpage corresponding to the target network resource address is abnormal is output, so that it is avoided that whether the target webpage is abnormal is determined directly based on a returned abnormal state code, and thus the purpose of effectively identifying whether the target webpage is abnormal is achieved, thereby achieving the technical effect of improving the accuracy of identifying the webpage, and further solving the technical problem of low accuracy of identifying the webpage.

Example 2

The web page recognition method of the present invention will be described below by way of example with reference to preferred embodiments.

In this embodiment, when a user accesses a web page through a browser, the browser sends a request to a server of the site where the web page is located. Before the browser receives and displays the web page, the server on which the web page is located returns a response line containing an HTTP status code to respond to the browser's request. Wherein, the HTTP status code is used for representing the response status of the server to the hypertext transfer protocol.

In this embodiment, the exception status code may be a 404 status code, for example, 404not found, which is a status code returned by the HTTP protocol for a web page error condition. When a user inputs a website in a browser, the server judges whether corresponding webpage information exists according to the website input by the user, if the corresponding webpage information does not exist, the fact that the link input by the user is probably a string of invalid links is indicated, and the server returns 404 a state code to the user to tell the user that the corresponding webpage information cannot be found.

The page corresponding to the abnormal state code is a 404 page, and the 404 page plays an important role in the field of data crawlers. When the crawler requests a wrong URL address, if the state code 404 and the page 404 are obtained, the URL address is known to be invalid, the page is not analyzed and processed, and the URL is deleted in the constructed site tree. If the crawler wrongly processes the 404 page as a normal page, resources are wasted (so that the 404 page usually does not contain valid contents), and invalid links are added in the site tree.

Therefore, correctly identifying whether the webpage is 404 pages has an important role in improving the performance and accuracy of the crawler.

In the related art, whether the currently processed target web page is 404 pages may be determined according to the HTTP status code. A status bar of the HTTP response may be obtained, and a status code may be extracted therefrom to determine whether the currently processed target web page is 404 pages. However, the 404 page identification technique based on the status code cannot support identification of the target web page in the following cases: a site-customized status code, not the status code of the page (404not found) of 404; and (4) fault tolerance processing is carried out on the site, and the current page or the default page is returned when the access is wrong.

This embodiment provides a web page identification method that improves the above-described method of HTTP status code-based 404 determination.

The embodiment obtains a uniform format, effective and nonexistent URL (false) through the analysis processing of the real URL (true), and accesses the obtained page (true) and the page (false). Whether the page (true) is 404 pages is judged through the comparative analysis of the state codes, types and contents of the pages (true) and the pages (false), and therefore the accuracy of identifying the web pages is improved.

The identification method of the web page of this embodiment is further described below.

Fig. 3 is a flow chart of a method for collecting response information of the real part 404 according to an embodiment of the invention. As shown in fig. 3, the method may include the steps of:

in step S301, a URL where the site really exists is acquired (the status code of the HTTP response is 200).

Step S302, a non-existent URL is forged according to a real URL.

Step S303, accessing the forged URL, obtaining the response information 404 of the site, and storing it for subsequent judgment.

The response information 404 in this embodiment may be information such as a response page 404, a status code 404, and the like.

Since some applications are very sensitive to the format of the URL, and even have different 404 processing flows for different URLs, for example, accessing different directories and files with different suffix names may have different 404 responses, it is necessary to ensure that the forged URL is consistent with the real URL format.

Fig. 4 is a flowchart of a method for forging a URL when a file name of a real URL exists according to an embodiment of the present invention. As shown in fig. 4, the method may include the steps of:

step S401, acquiring characters in the filename of the real URL.

In this embodiment, if a file name exists in a real URL, when a URL that does not exist is forged, the file name may be structured as a file.

Step S402, judging the type of the acquired character in the filename.

In step S403, the characters of the alphabet type are replaced with random letters.

In step S404, the characters of the numeric type are replaced with random numbers.

Step S405, the special symbol is kept unchanged.

Step S406, determining that the character in the filename is processed.

If the characters in the filename are judged to be not processed completely, executing step S401; if the character in the filename is judged to be processed, step S407 is executed.

In step S407, the forged URL including the modified filename is output.

Fig. 5 is a flowchart of a method for forging a URL when a file name does not exist in the URL according to an embodiment of the present invention. As shown in fig. 5, the method may include the steps of:

in step S501, a file name is randomly generated.

In the case where no file name exists in the real URL, a fixed-length file name (including letters and numbers) including letters and numbers can be randomly generated.

Step S502, a forged URL including a randomly generated file name is output.

Through the above processing, a forged URL basically conforming to the format of the real URL can be obtained.

And then accessing the forged URL to obtain a valid 404 page corresponding to the forged URL, comparing the valid 404 page with a target webpage needing to be processed, and judging whether the target webpage needing to be processed is the 404 page.

FIG. 6 is a flow chart of a method 404 for identifying pages according to an embodiment of the invention. As shown in fig. 6, the method may include the steps of:

in step S601, the status code of the target web page being processed is acquired.

In this embodiment, the target web page being processed, that is, the web page that needs to be determined whether it is abnormal or not, may be represented by Response.

In step S602, it is determined whether the status code of the target web page is 404.

Step S603, directly determining that the target web page is 404 pages.

If the status code of the target webpage is judged to be 404 status codes, the target webpage can be directly determined to be 404 pages.

In step S604, the status code of the forged valid 404 page is acquired.

If the state code of the target webpage is judged not to be the 404 state code, the target webpage may be the 404 page defined by the site, and the state code of the forged effective 404 page is obtained.

The forged valid 404 page of this embodiment may be represented by Response 404.

In step S605, it is determined whether or not the status code of the valid 404 page of the forged URL is 404 status code.

In step S606, it is determined that the target web page is not the 404 page.

And if the state code of the target webpage is judged not to be the 404 state code, and the state code of the valid 404 page of the forged URL is judged to be the 404 state code, determining that the target webpage is not the 404 page.

In step S607, it is determined whether the HTTP response header values of the target web page and the valid 404 page of the forged URL are the same.

If the state code of the valid 404 page of the forged URL is judged not to be the 404 state code, whether the HTTP response header values of the target webpage and the valid 404 page of the forged URL are the same or not is judged.

This embodiment may compare whether the content types of the target web page and the valid 404 page of the forged URL are the same, in determining whether the HTTP response header values of the target web page and the valid 404 page of the forged URL are the same.

In step S608, it is determined that the target web page is not the 404 page.

And if the HTTP response head values of the target webpage and the valid 404 page of the forged URL are judged to be different, determining that the target webpage is not the 404 page.

In step S609, the similarity between the target web page and the valid 404 page of the forged URL is calculated.

If the HTTP response head values of the target webpage and the valid 404 page of the forged URL are judged to be the same, the similarity between the target webpage and the valid 404 page of the forged URL is calculated.

The similarity of the entity body of the valid 404 page of the target web page and the forged URL may be compared.

In step S610, it is determined whether the similarity between the target web page and the valid 404 page of the forged URL is greater than a threshold.

The threshold value of this embodiment may be configured according to a specific scenario.

In step S611, it is determined that the target web page is not 404 pages.

If the similarity between the target webpage and the valid 404 pages of the forged URL is judged to be not greater than the threshold value, the target webpage is determined not to be 404 pages.

In step S612, the target web page is determined to be 404 pages.

If the similarity between the target webpage and the valid 404 pages of the forged URL is judged to be larger than the threshold value, the target webpage is determined to be 404 pages.

Fig. 7 is a flowchart of a device for determining 404 pages according to an embodiment of the present invention. As shown in fig. 7, the judging device 70 for page 404 may include: a URL falsification unit 71 and a page analysis processing unit 72.

The device for judging 404 page in this embodiment may be used to execute the method for identifying 404 page in the embodiment of the present invention. Among them, the URL falsification unit 71 may be configured to execute the method of falsifiing a URL when a file name exists in a real URL shown in fig. 4 and the method of falsifiing a URL when a file name does not exist in a URL shown in fig. 5 according to the embodiment of the present invention. The page analysis processing unit 72 may be configured to execute the method for identifying the page 404 shown in fig. 6 according to the embodiment of the present invention, and finally output a result indicating whether the target web page is abnormal.

In the embodiment, when the method and the apparatus provided by the present invention are applied, the following optimization schemes may be considered.

In the return page of the partial site, the content related to 404 is only a small part of the ratio, as shown in fig. 8, where fig. 8 is a schematic diagram of a position of the content related to 404 in the web page according to the embodiment of the present invention, and the web page where the content related to 404 is located further includes a title bar, a navigation bar, a copyright notice, a unit identifier, and other content.

the HTML content of a normal web page may be:

For such a situation, in this embodiment, before calculating the similarity, the target web page and the valid 404 pages of the forged URLs are analyzed in an integrated manner, so that the valid 404 pages of the forged URLs can be divided into different regions, and the region where the content is located can be identified and extracted 404. The method adopts the inverse algorithm of the maximum common substring to extract different contents between the target webpage and the forged effective 404 page of the URL, analyzes and processes, removes irrelevant parts, finally obtains the area where the 404 related contents in the forged effective 404 page of the URL are located and the corresponding area in the target webpage, calculates the similarity between the two areas, and takes the similarity as the similarity between the whole target webpage and the forged effective 404 page of the URL to bring the similarity into the subsequent calculation.

An application scenario of this embodiment is exemplified below.

The embodiment selects the application scenario of a classical web information crawler as an example analysis: the target site is fault tolerant to user access and does not return 404 directly but returns the current page when an incorrect, non-existent link is accessed. In this case, the target web page obtained in the embodiment is actually identical to the valid 404 page of the forged URL, and the page can be determined to be 404 page through similarity calculation. The conventional 404 page identification technique obtains a status code of 200 for the returned page, so that the current page is determined to be a normal page by mistake.

The embodiment implements analysis processing on a real URL through the above-mentioned falsification algorithm for keeping file names in a consistent format to obtain a forged URL in a consistent format, which is effective but does not exist, and can determine whether a target webpage is a 404 page through a 404 page determination method based on page content analysis and an optimized 404 page determination method for a large page, so that a state code of a website can be supported, which is self-defined, but not a state code of a 404 page, and a website is subjected to fault-tolerant processing, and a current page or a default page is returned when an access error occurs, thereby improving the recognition accuracy of the website 404 page and the wrong URL, and solving the technical problem of low accuracy in recognizing the webpage.

Example 3

The embodiment of the invention also provides a webpage identification device. It should be noted that the web page identification apparatus of this embodiment may be used to execute the web page identification method of the embodiment shown in fig. 1 of the present invention.

Fig. 9 is a schematic diagram of a web page recognition apparatus according to an embodiment of the present invention. As shown in fig. 9, the web page identification device 90 may include: a first acquisition unit 91, a second acquisition unit 92, and a third acquisition unit 93.

A first obtaining unit 91, configured to obtain target information obtained by accessing the target network resource address.

A second obtaining unit 92, configured to obtain exception information corresponding to the target information, where the exception information is obtained in advance by accessing an exception network resource address, and the exception network resource address is obtained by modifying the target network resource address.

And a third obtaining unit 93, configured to determine whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information.

The embodiment of the invention also provides another webpage identification device. It should be noted that the web page identification apparatus of this embodiment may be used to execute the web page identification method of the embodiment shown in fig. 2 of the present invention.

Fig. 10 is a schematic diagram of another web page recognition apparatus according to an embodiment of the present invention. As shown in fig. 10, the web page recognition apparatus 100 may include: a first display unit 101, a second display unit 102, and an output unit 103.

And the first display unit 101 is configured to display target information obtained by accessing the target network resource address in the interactive interface.

The second display unit 102 is configured to display abnormal information corresponding to the target information in the interactive interface, where the abnormal information is obtained in advance by accessing an abnormal network resource address, and the abnormal network resource address is obtained by modifying the target network resource address.

The output unit 103 is configured to output a recognition result, where the recognition result is used to indicate whether the target webpage corresponding to the target network resource address is abnormal or not, and is obtained through abnormal information.

The webpage identification device of the embodiment forges the abnormal network resource address based on the target network resource address, determines whether the target webpage corresponding to the target network resource address is abnormal or not through the abnormal information obtained by accessing the abnormal network resource address, and avoids the situation that whether the target webpage is abnormal or not directly determined based on the returned abnormal state code, so that the purpose of effectively identifying whether the target webpage is abnormal or not is achieved, the technical effect of improving the accuracy of identifying the webpage is achieved, and the technical problem of low accuracy of identifying the webpage is solved.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for identifying a web page, comprising:

acquiring target information obtained by accessing a target network resource address;

acquiring abnormal information corresponding to the target information, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address;

and determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

2. The method of claim 1, wherein the format of the exception network resource address is the same as the format of the target network resource address.

3. The method of claim 1, further comprising:

under the condition that a file name exists in the target network resource address, modifying the file name, and determining the target network resource address after the file name is modified as the abnormal network resource address;

and under the condition that no file name exists in the target network address, randomly generating a target file name, and determining the target network resource address comprising the target file name as the abnormal network resource address.

4. The method of claim 1, wherein determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information comprises:

and determining whether the target webpage is abnormal or not based on the abnormal information under the condition that the target information does not comprise an abnormal state code corresponding to the abnormal network resource address.

5. The method of claim 4, wherein in the case that the target information does not include an exception status code, determining whether the target webpage is abnormal based on the exception information comprises:

and under the condition that the target information does not comprise the abnormal state code, if the abnormal information comprises the abnormal state code, determining that the target webpage is a non-target abnormal webpage.

6. The method of claim 5, further comprising:

if the abnormal information does not comprise the abnormal state code, comparing a first response head value in the target information with a second response head value in the abnormal information;

and under the condition that the first response head value and the second response head value are different, determining that the target webpage is a non-target abnormal webpage.

7. The method of claim 6, further comprising:

under the condition that the first response head value and the second response head value are the same, acquiring a first similarity between the target information and the abnormal information;

determining the target webpage to be a target abnormal webpage under the condition that the first similarity is larger than a first threshold value;

and under the condition that the first similarity is not larger than a first threshold value, determining that the target webpage is the non-target abnormal webpage.

8. The method of claim 1, wherein determining whether the target webpage corresponding to the target network resource address is abnormal based on the abnormality information comprises:

dividing the target abnormal webpage corresponding to the abnormal network resource address into a plurality of first sub-webpages;

identifying an abnormal sub-web page comprising the abnormal information from the plurality of first sub-web pages;

dividing the target webpage into a plurality of second sub-webpages, and determining the target sub-webpage corresponding to the abnormal sub-webpage from the plurality of second sub-webpages;

acquiring a second similarity between the abnormal sub-web page and the target sub-web page;

determining the target webpage to be a target abnormal webpage under the condition that the second similarity is larger than a second threshold value;

and under the condition that the second similarity is not larger than a second threshold value, determining that the target webpage is a non-target abnormal webpage.

9. A method for identifying a web page, comprising:

displaying target information obtained by accessing the target network resource address in the interactive interface;

displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address;

and outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through the abnormal information.

10. A web page recognition apparatus, comprising:

a first acquisition unit configured to acquire target information obtained by accessing a target network resource address;

a second obtaining unit, configured to obtain abnormal information corresponding to the target information, where the abnormal information is obtained in advance by accessing an abnormal network resource address, and the abnormal network resource address is obtained by modifying the target network resource address;

and the third acquisition unit is used for determining whether the target webpage corresponding to the target network resource address is abnormal or not based on the abnormal information.

11. A web page recognition apparatus, comprising:

the first display unit is used for displaying target information obtained by accessing the target network resource address in the interactive interface;

the second display unit is used for displaying abnormal information corresponding to the target information in the interactive interface, wherein the abnormal information is obtained by accessing an abnormal network resource address in advance, and the abnormal network resource address is obtained by modifying the target network resource address;

and the output unit is used for outputting an identification result, wherein the identification result is used for indicating whether the target webpage corresponding to the target network resource address is abnormal or not and is obtained through the abnormal information.

12. A storage medium, characterized in that the storage medium comprises a stored program, wherein when the program runs, a device where the storage medium is located is controlled to execute the web page identification method according to any one of claims 1 to 9.

13. A processor, configured to run a program, wherein the program executes to perform the web page identification method according to any one of claims 1 to 9.