CN111104617A

CN111104617A - Webpage data acquisition method and device, electronic equipment and storage medium

Info

Publication number: CN111104617A
Application number: CN201911265567.0A
Authority: CN
Inventors: 方琦
Original assignee: Xian Yep Telecommunication Technology Co Ltd
Current assignee: Xian Yep Telecommunication Technology Co Ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2020-05-05
Anticipated expiration: 2039-12-11
Also published as: CN111104617B

Abstract

The application provides a webpage data acquisition method and device, electronic equipment and a storage medium. The webpage data acquisition method comprises the steps of determining a queue to be crawled according to a URL set to be crawled and a knowledge mode base, wherein the queue to be crawled comprises a first candidate queue, the first candidate queue comprises at least one first URL, webpage data pointed by the first URL are acquired, sub-depth attribute values included by sub-URLs are updated according to the relevance between webpage keywords of webpage contents included by the webpage data and a target theme until the sub-depth attribute values are zero, the acquisition of the webpage data pointed by the sub-URLs is stopped until the sub-depth attribute values are zero, and all the sub-URLs with the sub-depth attribute values being nonzero and the webpage data pointed by the first URLs are determined as the target webpage data. The method and the device have the advantages that the vertical search of the webpage data is realized, the reference is provided for services such as text clustering and the like, the accuracy and pertinence of the acquired webpage data are improved, and the space-time complexity of the acquisition process is reduced.

Description

Webpage data acquisition method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for acquiring webpage data, an electronic device, and a storage medium.

Background

With the rapid development of computer technology, the internet has become one of the main carriers for reflecting various information data, so that the web page data is explosively increased. Currently, in many scenarios, web page data needs to be extracted from the internet for reference. How to quickly and accurately acquire webpage data related to a target theme from massive network information data is particularly important.

A conventional method for acquiring web page data is usually implemented by using a web crawler technology, for example, acquiring all web page data that may be pointed to by one or several initial Uniform Resource Locators (URLs).

However, the existing method is considered to be the same for each URL, and cannot selectively and primarily acquire the required web page data through the crawling technology for the URL, so that the acquired web page data has poor pertinence, and services such as vertical search and text clustering of the data cannot be realized.

Disclosure of Invention

The application provides a webpage data acquisition method, a webpage data acquisition device, electronic equipment and a storage medium, and aims to solve the technical problems that the existing webpage data acquisition method is poor in pertinence and cannot realize services such as vertical search and text clustering of data.

In a first aspect, the present application provides a method for acquiring webpage data, including:

determining a queue to be crawled according to a URL set of uniform resource locators to be crawled and a knowledge mode base, wherein the queue to be crawled comprises a first candidate queue, the first candidate queue comprises at least one first URL, the first URL is a URL with a known regular mode, the regular mode is used for representing the theme attribute of webpage data pointed by the URL to be crawled, the URL to be crawled comprises a depth attribute value, and the depth attribute value is used for representing the intimacy degree between the webpage data pointed by the URL to be crawled and parent webpage data;

acquiring webpage data pointed by the first URL, wherein the webpage data comprise webpage content and sub URLs, the webpage content comprises webpage keywords, and the sub URLs comprise sub depth attribute values;

updating the sub-depth attribute value according to the relevancy of the webpage keyword and the target theme;

and stopping acquiring the webpage data pointed by the sub URL until the sub-depth attribute value is zero, and determining the webpage data pointed by all the sub URLs with the sub-depth attribute value being nonzero and the webpage data pointed by the first URL as target webpage data.

In one possible design, the updating the sub-depth attribute value according to the relevance of the webpage keyword and the target topic includes:

if the correlation degree is larger than a preset threshold value, resetting the sub-depth attribute value to a maximum value so as to update the sub-depth attribute value;

and if the correlation degree is not larger than the preset threshold value, attenuating the sub-depth attribute value once to update the sub-depth attribute value.

In one possible design, the queue to be crawled further includes:

a second candidate queue comprising at least one second URL, the second URL being a URL for which the regular pattern is unknown;

and when the acquisition of the webpage data pointed by all the sub URLs in the first candidate queue is stopped, acquiring the webpage data pointed by the second URL.

In one possible design, before determining the queue to be crawled according to the set of URLs to be crawled and the knowledge pattern base, the method further includes:

judging whether the regular pattern of each URL in a preset URL set is known or not, wherein the webpage data pointed by each URL in the preset URL set and the target webpage data have theme correlation;

if the judgment result is yes, determining the knowledge mode base according to the rule mode;

and if the judgment result is negative, adding the URL into a URL set to be learned, and determining the knowledge mode base according to the URL set to be learned.

In one possible design, after the obtaining the data of the webpage pointed by the first URL, the method further includes:

determining a priority of the first URL, the priority being used to indicate an order of arrangement of the first URL in the first candidate queue;

and acquiring the webpage data pointed by the sub URLs according to the arrangement sequence.

In one possible design, the determining the priority of the first URL includes:

and determining the priority of the first URL through a preset algorithm according to the webpage keywords and the depth attribute value.

In a second aspect, the present application provides a web page data acquiring apparatus, including:

the device comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for determining a to-be-crawled queue according to a to-be-crawled Uniform Resource Locator (URL) set and a knowledge mode base, the to-be-crawled queue comprises a first candidate queue, the first candidate queue comprises at least one first URL, the first URL is a URL with a known regular mode, the regular mode is used for representing the theme attribute of webpage data pointed by the to-be-crawled URL, the to-be-crawled URL comprises a depth attribute value, and the depth attribute value is used for representing the intimacy degree between the webpage data pointed by the to-be-crawled URL and parent webpage data;

the acquisition module is used for acquiring webpage data pointed by the first URL, wherein the webpage data comprise webpage content and sub URLs, the webpage content comprises webpage keywords, and the sub URLs comprise sub depth attribute values;

the updating module is used for updating the sub-depth attribute value according to the relevancy of the webpage keyword and the target theme;

and the second processing module is used for stopping acquiring the webpage data pointed by the sub URL when the sub-depth attribute value is zero, and determining the webpage data pointed by all the sub URLs with the sub-depth attribute value being nonzero and the webpage data pointed by the first URL as target webpage data.

In one possible design, the update module is specifically configured to:

In one possible design, the first processing module is further configured to:

determining a second candidate queue, wherein the second candidate queue comprises at least one second URL, and the second URL is a URL for which the regular pattern is unknown;

In one possible design, the apparatus further includes:

a third processing module, configured to:

In one possible design, the apparatus further includes:

a fourth processing module, configured to:

In one possible design, the fourth processing module is specifically configured to:

In a third aspect, the present application provides an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method for web page data acquisition according to the first aspect and optional aspects.

In a fourth aspect, the present application provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the method for acquiring webpage data according to the first aspect and optional aspects.

The application provides a webpage data acquisition method, a device, electronic equipment and a storage medium, firstly determining a queue to be crawled according to a URL set to be crawled and a knowledge mode base, wherein the determined queue to be crawled comprises a first candidate queue, the first candidate queue comprises at least one first URL, the regular mode of the first URL is known, then acquiring webpage data pointed by the first URL, the webpage data comprises webpage content and sub URLs, the webpage content comprises webpage keywords, the sub URLs comprise sub-depth attribute values, updating the sub-depth attribute values according to the correlation degree of the webpage keywords and a target theme, when the sub-depth attribute values are zero, stopping acquiring the webpage data pointed by the sub URLs, and determining the webpage data of all sub-pointed URLs with the sub-depth attribute values being nonzero and the webpage data pointed by the first URL as target webpage data, therefore, vertical search of the webpage data is achieved, the acquired webpage data provide reference for services such as text clustering and the like, accuracy and pertinence of the acquired webpage data are improved, and space-time complexity of a webpage data acquisition process is reduced.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is an application scene diagram of a web page data acquisition method provided by the present application;

fig. 2 is a schematic flowchart of a method for acquiring webpage data according to an embodiment of the present application;

FIG. 3 is a flow chart illustrating a method for determining a knowledge pattern base according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of another method for acquiring web page data according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a web page data acquisition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of methods and apparatus consistent with certain aspects of the present application, as detailed in the appended claims.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

With the rapid development of computer technology, the internet has become one of the main carriers for reflecting various information data, and the webpage data is explosively increased. Currently, in many scenarios, web page data needs to be extracted from the internet for reference, for example, for web page data about after-sales products, a user may obtain web page data of a certain product related to the after-sales subject, so as to optimize the product and improve user experience. It can be seen that how to quickly and accurately acquire webpage data related to a target topic from massive network information data is particularly important, and is also a challenge facing the present. Conventional web page data acquisition methods are usually implemented by web crawler technology, for example, starting from one or several initial URLs (uniform resource locators) to acquire all web page data that it may point to. However, the existing method is one-view for each URL, and cannot selectively and primarily acquire webpage data related to a target topic by using a crawling technology for the URL, so that the acquired webpage data has poor pertinence, and services such as vertical search and text clustering of the data cannot be realized.

In view of the above problems in the prior art, the present application provides a method and an apparatus for acquiring web page data, an electronic device, and a storage medium. Firstly, determining a queue to be crawled according to a URL set to be crawled and a knowledge mode base, wherein the determined queue to be crawled comprises a first candidate queue, the first candidate queue comprises at least one first URL, the regular mode of the first URL is known, then, acquiring webpage data pointed by the first URL, the webpage data comprises webpage contents and sub URLs, the webpage contents comprise webpage keywords, the sub URLs comprise sub-depth attribute values, updating the sub-depth attribute values according to the correlation degree of the webpage keywords and a target theme, when the sub-depth attribute values are zero, stopping acquiring the webpage data pointed by the sub URLs, determining the webpage data pointed by all the sub URLs with the sub-depth attribute values being non-zero and the webpage data pointed by the first URL as target webpage data, thereby realizing the vertical search of the webpage, and providing references for services such as text clustering and the like for the acquired webpage data, the accuracy and pertinence of the acquired webpage data are improved.

The technical solution of the present application will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is an application scenario diagram of the method for acquiring webpage data provided by the present application, as shown in fig. 1, the method for acquiring webpage data provided by the present application is executed by an electronic device, where the electronic device may be a terminal device such as a mobile phone, a computer, a tablet computer, and a server, and fig. 1 illustrates a server 100 as an example. By the webpage data acquisition method, vertical search of the webpage can be achieved, and reference is provided for subsequent services such as text clustering according to the acquired webpage data. And compared with the prior art, the accuracy and pertinence of the acquired webpage data are improved.

Referring to fig. 1, a notebook computer 200 shown in fig. 1 represents a source of a large amount of web page data in the internet, wherein each web page data corresponds to a URL, which is a compact representation of a location and an access method of a network resource and is an address of a standard resource on the internet. Therefore, the webpage data related to the topic are obtained from a large amount of webpage data, in other words, the webpage data related to the topic are obtained from the URL to be crawled.

The webpage data acquisition method includes the steps that firstly, a queue to be crawled is determined according to a URL set to be crawled and a knowledge mode base, wherein the queue to be crawled comprises a first candidate queue, the first candidate queue comprises at least one first URL, the regular mode of the first URL is in a known state, and the regular mode is used for representing the theme property of webpage data pointed by the URL to be crawled, for example, the name of a theme to be expressed by the webpage data is related to a mobile phone, or the theme expressed by the webpage data is related contents of different types such as home furnishing, clothes, electric appliances and the like, for example, the name of the mobile phone, and the types of the home furnishing, the clothes, the electric appliances and the like are regular modes. And the URL to be crawled also comprises a depth attribute value, the depth attribute value is used for representing the degree of affinity between the webpage data pointed by the URL to be crawled and the parent webpage data, and as can be seen, each URL has a corresponding depth attribute value. After the queue to be crawled is determined, webpage data pointed by a first URL in a first candidate queue are obtained, wherein the webpage data comprise webpage content and sub URLs, the webpage content comprises webpage keywords, and the sub URLs comprise sub depth attribute values. Further, the sub-depth attribute value is updated according to the relevance between the webpage keywords and the target subject, until the sub-depth attribute value is zero, further acquisition of webpage data pointed by the sub-URLs is stopped, and all webpage data pointed by the sub-URLs with the sub-depth attribute value being nonzero and webpage data pointed by the first URL are determined as target webpage data, so that vertical search of the webpage data is achieved, reference is provided for services such as text clustering and the like, and the accuracy and pertinence of the acquired webpage data are improved.

Fig. 2 is a schematic flowchart of a method for acquiring webpage data according to an embodiment of the present application. As shown in fig. 2, the method for acquiring web page data provided by this embodiment is executed by an electronic device, and includes:

s201: and determining a queue to be crawled according to the URL set to be crawled and the knowledge mode base.

The queue to be crawled comprises a first candidate queue, the first candidate queue comprises at least one first URL, and the first URL is a URL with a known regular pattern. The regular mode is used for representing the theme attribute of the webpage data pointed by the URL to be crawled, the URL to be crawled comprises a depth attribute value, and the depth attribute value is used for representing the degree of intimacy between the webpage data pointed by the URL to be crawled and parent webpage data.

S202: and acquiring the webpage data pointed by the first URL.

The webpage data comprise webpage content and sub URLs, the webpage content comprises webpage keywords, and the sub URLs comprise sub depth attribute values.

Step S201 and step S202 will be described in combination.

The URL set to be crawled comprises a plurality of URLs to be crawled, wherein the number of the URLs to be crawled is at least two. The set of URLs to be crawled may be understood to represent a vast amount of web page data that may include web page data related to the target topic. Each URL to be crawled is an arbitrary URL, and the embodiment of the present application is not limited thereto.

The to-be-crawled URL comprises a depth attribute value, which means that each to-be-crawled URL in a to-be-crawled URL set corresponds to a determined depth attribute value, wherein the depth attribute value is used for representing the degree of affinity between webpage data pointed by the to-be-crawled URL and parent webpage data, in other words, the depth attribute value is used for representing the topic relevance of the current URL possibly inheriting the parent URL.

The knowledge mode library comprises a plurality of regular modes, and the regular modes are used for representing the theme attributes of the webpage data pointed by the URL to be crawled, for example, the webpage content theme of the webpage data pointed by the URL to be crawled is related to household articles, electric appliances and clothes, and the regular modes of the URL subordinate to the webpage data of the type are the household articles, the electric appliances and the clothes. Therefore, the regular pattern of each URL may be set by a person skilled in the art according to a topic to which the webpage data to be obtained is directed, and the embodiment of the present application is not limited thereto. It should be noted that, in the embodiment of the present application, each URL to be crawled corresponds to a fixed regular pattern.

Determining a queue to be crawled according to the URL set to be crawled and the knowledge mode base, wherein each URL to be crawled in the URL set to be crawled is grouped according to regular modes in the knowledge mode base to form the queue to be crawled. Specifically, a regular pattern of each URL to be crawled in the URL set to be crawled is matched with an existing regular pattern in the knowledge pattern library, and if the regular pattern of the URL to be crawled exists in the knowledge pattern library, in other words, for the knowledge pattern library, when the regular pattern of the URL to be crawled is known, the URL to be crawled is classified into a first candidate queue, and the URL to be crawled is the first URL. It is understood that the number of URLs to be crawled in the first candidate queue is at least one. Thus, the first candidate queue includes at least one first URL, where the first URL is a URL for which a regular pattern is known to the knowledge pattern base.

And acquiring the webpage data pointed by the first URL, namely downloading and analyzing the first URL, wherein the result of downloading and analyzing is the webpage data pointed by the first URL. The webpage data comprise webpage content and sub URLs, the webpage content comprises webpage keywords, and the sub URLs comprise sub depth attribute values. It is to be understood that there may be a plurality of web page keywords or one web page keyword, and the embodiment of the present application is not limited thereto. The sub-depth attribute value represents the affinity between the webpage data pointed by the sub-URL and the webpage data pointed by the first URL.

It should be noted that the first URL is a URL to be crawled that exists in the knowledge pattern library in a regular pattern in the URL set to be crawled, and when it is determined whether the regular pattern of the URL to be crawled is known in the knowledge pattern library, the data of the webpage pointed by the URL to be crawled needs to be acquired, so that the sequential relationship between step S201 and step 202 is not limited in the embodiment of the present application.

S203: and updating the sub-depth attribute value according to the relevance of the webpage keyword and the target theme.

As described above, the sub-depth attribute value is used to represent the affinity between the webpage data pointed by the sub-URL and the webpage data pointed by the first URL, and after the webpage data pointed by the first URL is obtained, the sub-depth attribute value is updated according to the relevance between the webpage keyword and the target topic. In other words, the correlation degree between the webpage keywords and the target subject is determined, and the sub-depth attribute value is updated according to the correlation degree.

In one possible implementation, if the correlation is greater than the preset threshold, the sub-depth attribute value is reset to the maximum value, so as to update the sub-depth attribute value. It can be understood that if the correlation degree is greater than the preset threshold, it indicates that the webpage data pointed by the sub URL has a great crawling potential for the target webpage data, so that the sub depth attribute value is reset to the maximum value. The maximum value may be set according to the number of layers of the web pages to be crawled for the specific URL to be crawled, for example, the maximum value is set to 30, that is, the number of layers of the web pages that may need to be crawled from the current child URL is 30. The present embodiment is not limited to this.

In another possible implementation manner, if the correlation degree is not greater than the preset threshold, the sub-depth attribute value is attenuated once to update the sub-depth attribute value. The relevance is not greater than a preset threshold, and it can be understood that if the relevance of the webpage keyword included in the webpage content of the webpage data pointed by the first URL and the target topic does not exceed the preset threshold, the current sub-depth attribute value corresponding to the sub-URL is attenuated once.

The preset threshold of the relevancy can be set according to the URL to be crawled, and the embodiment of the application is not limited.

And updating the sub-depth attribute value according to the relevance of the webpage keyword and the target theme, so that the crawling path can be controlled in a limited way. For example, there is a case where the web page data pointed by the first URL includes a sub-URL, but the relevance between the web page keyword included in the web page content of the web page data pointed by the sub-URL and the target topic does not exceed a preset threshold, and the relevance between the web page keyword included in the web page content of the web page data pointed by the sub-URL and the target topic exceeds a preset threshold, at this time, by updating the sub-depth attribute value, the crawling path can bypass the URL to which the web page keyword whose relevance to the target topic is not greater than the preset threshold is subordinate, so that the restricted tentative crawling of the crawling path is realized, the coverage of the crawler can be improved, and the interference of the noisy web page is reduced.

It can be understood that steps S202 and S203 are loop steps, that is, after the sub-depth attribute value is updated, the web page data pointed by the sub-URL is continuously obtained, the web page data includes the web page content and the sub-URL, the web page content also includes the web page keyword, the current sub-URL also includes the sub-depth attribute value, the current sub-depth attribute value is further updated according to the correlation between the current web page keyword and the target topic, and then step S202 is repeated, the current sub-URL is equivalent to the first URL. Until the sub-depth attribute value is zero, step S204 is performed.

S204: and stopping acquiring the webpage data pointed by the sub URL until the sub-depth attribute value is zero, and determining the webpage data pointed by all the sub URLs with the sub-depth attribute value being nonzero and the webpage data pointed by the first URL as target webpage data.

And step S202 and step S203 are executed in a loop, until the sub-depth attribute value is zero, the acquisition of the webpage data pointed by the sub-URL is stopped. It is understood that the sub-URLs are all sub-URLs in the first URL directory determined in step S201, and not only refer to URLs to which web page data separated by one layer from the web page data pointed by the first URL is subordinate, but also refer to sub-depth attribute values included in all sub-URLs, and accordingly, refer to sub-depth attribute values included in URLs to which web page data separated by one layer from the web page data pointed by the first URL is subordinate. When the current sub-depth attribute value is zero, the crawling path where the sub-URL subordinate to the sub-depth attribute value is located is completely finished acquiring the webpage data, the continuous acquisition of the data pointed by the current sub-URL is stopped, and the webpage data pointed by all the sub-URLs with the sub-depth attribute values being nonzero and the webpage data pointed by the first URL on the crawling path where the current sub-URL is located are determined as target webpage data. Thereby completing the acquisition of all the web page data under the first URL list in step S201.

It is understood that the foregoing steps S202 to S204 are performed for all the first URLs in the first candidate queue.

The method for acquiring webpage data provided by this embodiment determines a queue to be crawled according to a URL set to be crawled and a knowledge pattern library, where the determined queue to be crawled includes a first candidate queue including at least one first URL whose regular pattern is known, and acquires webpage data pointed by the first URL, where the webpage data includes webpage content and sub-URLs, the webpage content includes webpage keywords, the sub-URLs include sub-depth attribute values, the sub-depth attribute values are updated according to the relevance between the webpage keywords and a target topic, the current step is performed in a loop until the sub-depth attribute values are zero, the webpage data pointed by the sub-URLs are stopped being continuously acquired, and all webpage data pointed by the sub-URLs with sub-depth attribute values that are nonzero and the webpage data pointed by the first URL are determined as target webpage data, therefore, the vertical search of the webpage data is realized, and the crawling path is controlled in a limited way. By determining the queue to be crawled, the URLs to be crawled in the first candidate queue are all URLs with known regular patterns, and the webpage data pointed by the URLs provide reference for services such as text clustering and the like. And moreover, the accuracy and pertinence of the acquired webpage data are improved.

Optionally, the queue to be crawled further includes a second candidate queue, where the second candidate queue includes at least one second URL, and the second URL is a URL whose regular pattern is unknown.

The method includes that a to-be-crawled queue determined according to a to-be-crawled URL set and a knowledge pattern base further comprises a second candidate queue, and it can be understood that URLs to be crawled, of which regular patterns are known relative to the knowledge pattern base, are listed in the first candidate queue, URLs to be crawled, of which regular patterns are unknown relative to the knowledge pattern base, are listed in the second candidate queue, namely the second candidate queue comprises at least one second URL, and the second URL is a URL of which regular patterns are unknown.

It can be understood that when the web page data pointed by all the sub URLs in the first candidate queue are obtained, that is, when the web page data pointed by all the sub URLs in the first candidate queue stops being obtained, the web page data pointed by the second URL, that is, the web page data pointed by the URL to be crawled with unknown regular pattern, is obtained. It should be noted that the steps performed on the URLs to be crawled in the second candidate queue are the same as the steps performed on the URLs to be crawled in the first candidate queue in the embodiment shown in fig. 1, that is, the steps S202 to S204 are also performed on the URLs to be crawled in the second candidate queue until the sub-depth attribute value of the sub-URL included in the web page data pointed by the second URL is zero, the acquisition of the web page data pointed by the sub-URL included in the web page data pointed by the second URL is finished, and the web page data pointed by all the sub-URLs with the sub-depth attribute value being non-zero and the web page data pointed by the second URL are determined as the target web page data.

In general, target web page data within a preset range related to a target subject can be obtained through the URLs to be crawled in the first candidate queue. And firstly determining a first candidate queue according to the URL to be crawled and the knowledge mode base, so as to improve the accuracy and pertinence of the acquired webpage data.

As described above, for any URL to be crawled, it is necessary to determine whether the URL to be crawled is the first candidate queue according to the regular pattern of the knowledge pattern base, and thus, before determining the queue to be crawled according to the URL set to be crawled and the knowledge pattern base, it is further included that the knowledge pattern base needs to be determined.

In a possible design, a method for determining a knowledge pattern base is shown in fig. 3, where fig. 3 is a flowchart illustrating a method for determining a knowledge pattern base according to an embodiment of the present application, where the method includes:

s301: and judging whether the regular mode of each URL in the preset URL set is known or not.

The webpage data pointed by each URL in the preset URL set and the target webpage data have topic relevance.

The method includes acquiring a preset URL set from a data analysis unit, where webpage data pointed by each URL in the preset URL set and target webpage data have topic relevance, and it can be understood that a degree of the topic relevance described herein may be determined according to a topic of the target webpage data to be acquired, which is not limited in this embodiment of the present application. It is understood that the data analysis unit is a unit located before the web page data acquisition step in the whole engineering system for acquiring the web page data, and a preset URL having a topic relation with the target web page data can be acquired from the unit.

In this step, it is determined whether the regular pattern of each URL in the preset URL set is known, in other words, for each URL in the preset set, it is first determined whether the regular pattern of the URL is known, where whether the regular pattern is known can be understood as being determined, and when the regular pattern is known, that is, the regular pattern is determined, step S302 is executed; when unknown, i.e., the regular pattern is uncertain, step S303 is performed.

S302: and if so, determining a knowledge mode base according to the regular mode.

And when the regular pattern of the URL in the preset URL set is known, constructing a knowledge pattern base according to the regular pattern of the URL, namely determining the knowledge pattern base according to the regular pattern.

S303: if the judgment result is negative, adding the URL into the URL set to be learned, and determining a knowledge mode base according to the URL set to be learned.

And when the regular pattern of the URL in the preset URL set is unknown, adding the URL into the URL set to be learned, and further determining a knowledge pattern library according to the URL set to be learned. Specifically, learning may be performed once after every N URLs with unknown regular patterns are added in the URL set to be learned, so as to determine a new regular pattern, and add the regular pattern to the knowledge pattern library. It should be appreciated that the learning described may be understood as the generalized determination of new patterns of regularity associated with the target subject from the URL. The N is any positive integer greater than 1, and the embodiments of the present application are not limited thereto.

The method includes the steps that a knowledge mode base is determined, whether a regular mode of each URL in a preset URL set is known or not is judged, and when the regular mode is known, the regular mode is directly added into the knowledge mode base; when unknown, the URL is added into the URL set to be learned, when the number of the URLs which are newly added into the URL set to be learned and have unknown regular patterns reaches a certain number, the URLs are learned, so that the new regular patterns are determined, and the new regular patterns are added into the knowledge pattern library, so that the knowledge pattern library is determined. The determined knowledge mode base provides clear knowledge for determining the queue to be crawled in the subsequent crawling strategy stage, so that whether the URL to be crawled is the first candidate queue or not is determined according to the regular mode, and the accuracy and pertinence of obtaining target webpage data can be improved.

On the basis of the above embodiment, for the queue to be crawled, in the process of acquiring the target webpage data, in order to avoid system crash caused by excessively large memory occupation, the length of the queue to be crawled needs to be determined, in other words, an upper limit of the length of the queue to be crawled needs to be set. And for the interior of the queue to be crawled, the arrangement of the URLs to be crawled in the queue to be crawled has a sequence, and the webpage data pointed by the sub URLs included in the webpage data pointed by each URL to be crawled are acquired according to the sequence, so that the time complexity and the space complexity of the process of acquiring the target webpage data can be reduced, and the process of acquiring the target webpage data is accelerated.

In a possible design, after acquiring the webpage data pointed by the first URL, the method further includes the steps shown in fig. 4, where fig. 4 is a schematic flow chart of another method for acquiring webpage data according to an embodiment of the present application, and as shown in fig. 4, the method includes:

s401: the priority of the first URL is determined.

Wherein the priority is used to indicate an order of arrangement of the first URL in the first candidate queue.

As described above, the length of the queue to be crawled needs to set an upper limit, for example, 30 URLs may be set, and it may be understood that the number of URLs to be crawled in the queue to be crawled is 30, and this is not limited in this embodiment of the present application. Therefore, for the interior of the queue to be crawled, the arrangement order of the URLs to be crawled in the queue needs to be determined, namely the priority of the URLs to be crawled is determined. For the first candidate queue, a priority of the first URL is determined, and the priority is used for indicating the arrangement order of the first URL in the first candidate queue.

Similarly, for the second candidate queue, a priority of the second URL is determined, and the priority is used for indicating the arrangement order of the second URL in the second candidate queue.

Taking the first candidate queue as an example, how to determine the priority of the first URL may be determined according to, for example, a relevance of the URL to be crawled to the target topic, or determined by or according to a possible length of a URL crawling path to be crawled, which is not limited in the embodiments of the present application.

One possible implementation is to determine the priority of the first URL through a preset algorithm according to the webpage keywords and the depth attribute value. For example, the preset algorithm can be expressed by the following formula:

priority(URL)＝a₁*Σkeyword Priority_i+a₂*depth

wherein Priority (URL) represents the Priority of the first URL, Σ keyword Priority_iThe sum of the weight values of the relevancy of the webpage keyword and the target topic included in the webpage data pointed by the first URL can represent the possible relevancy of the first URL and the target topic, depth represents the depth attribute value of the first URL, α₁And α₂The weights are corresponding to the respective weights, and the weights can be determined according to the specific URL to be crawled and the target topic, which is not limited in the embodiment of the present application.

Similarly, the priority of the second URL in the second candidate queue may also be determined by the above formula, which is not described herein again.

S402: and acquiring the webpage data pointed by the sub URLs according to the arrangement sequence.

After the arrangement sequence of the first URL in the first candidate queue is determined, further acquiring webpage data pointed by sub URLs included in the webpage data pointed by the first URL according to the arrangement sequence, and finally determining target webpage data.

Similarly, for the second candidate queue, after the ranking order of the second URL in the second candidate queue is determined, the web page data pointed by the sub URL included in the web page data pointed by the second URL is further acquired according to the ranking order, and the target web page data is finally determined.

According to the webpage data acquiring method provided by the embodiment, after the webpage data pointed by the first URL are acquired, the priority of the first URL is determined, the priority is used for indicating the arrangement sequence of the first URL in the first candidate queue, and the webpage data pointed by the sub URLs are further acquired according to the arrangement sequence, so that the crawling path determined according to the first URL in the first candidate queue is orderly performed, the time complexity and the space complexity of acquiring the target webpage data can be reduced, and the process of acquiring the target webpage data is accelerated.

Similarly, for the second candidate queue, after acquiring the webpage data pointed by the second URL, the priority of the second URL is determined, and the webpage data pointed by the child URLs of the second candidate queue is further acquired according to the ranking order, so that the crawling path determined according to the second URL in the second candidate queue is performed in order, the time complexity and the space complexity of acquiring the target webpage data can be reduced, and the process of acquiring the target webpage data is accelerated.

Fig. 5 is a schematic structural diagram of a web page data acquiring apparatus according to an embodiment of the present application, and as shown in fig. 5, the web page data acquiring apparatus 500 according to the embodiment includes:

the first processing module 501 is configured to determine a queue to be crawled according to a URL set to be crawled and a knowledge pattern base.

The to-be-crawled queue comprises a first candidate queue, the first candidate queue comprises at least one first URL, the first URL is a URL with a known regular pattern, the regular pattern is used for representing the theme attribute of the webpage data pointed by the to-be-crawled URL, the to-be-crawled URL comprises a depth attribute value, and the depth attribute value is used for representing the intimacy between the webpage data pointed by the to-be-crawled URL and parent webpage data.

The obtaining module 502 is configured to obtain data of a webpage pointed by the first URL.

And the updating module 503 is configured to update the sub-depth attribute value according to the relevance between the webpage keyword and the target topic.

The second processing module 504 is configured to stop acquiring the webpage data pointed by the sub URLs when the sub-depth attribute value is zero, and determine the webpage data pointed by all sub URLs with the sub-depth attribute value being nonzero and the webpage data pointed by the first URL as target webpage data.

The web page data obtaining apparatus 500 provided in this embodiment is similar to the implementation principle and the effect of the method embodiment shown in fig. 2, and is not described herein again.

Optionally, the updating module 503 is specifically configured to:

if the correlation degree is larger than a preset threshold value, resetting the sub-depth attribute value to the maximum value so as to update the sub-depth attribute value;

and if the correlation degree is not greater than the preset threshold value, attenuating the sub-depth attribute value once to update the sub-depth attribute value.

In one possible design, the first processing module 501 is further configured to:

determining a second candidate queue, wherein the second candidate queue comprises at least one second URL, and the second URL is a URL with an unknown regular pattern;

In one possible design, the web page data obtaining apparatus 500 further includes a third processing module 505, configured to:

judging whether the regular mode of each URL in the preset URL set is known or not, wherein the webpage data pointed by each URL in the preset URL set and the target webpage data have theme correlation;

if the judgment result is yes, determining a knowledge mode base according to the rule mode;

if the judgment result is negative, adding the URL into the URL set to be learned, and determining a knowledge mode base according to the URL set to be learned.

The web page data obtaining apparatus 500 provided in this embodiment is similar to the implementation principle and the effect of the method embodiment shown in fig. 3, and is not described herein again.

In one possible design, the web page data obtaining apparatus 500 further includes a fourth processing module 506, configured to:

determining the priority of the first URL, wherein the priority is used for indicating the arrangement sequence of the first URL in the first candidate queue;

Optionally, the fourth processing module 506 is further configured to:

determining the priority of the second URL, wherein the priority is used for indicating the arrangement sequence of the second URL in the second candidate queue;

The web page data obtaining apparatus 500 provided in this embodiment is similar to the implementation principle and the effect of the method embodiment shown in fig. 4, and is not described herein again.

Optionally, the fourth processing module 506 is specifically configured to:

Optionally, the fourth processing module 506 is further specifically configured to:

and determining the priority of the second URL through a preset algorithm according to the webpage keywords and the depth attribute value.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 6, an electronic device 600 according to the embodiment includes:

at least one processor 601; and

a memory 602 communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory 602 stores instructions executable by the at least one processor 601, where the instructions are executed by the at least one processor 601, so that the at least one processor 601 can perform the steps of the web page data obtaining method in the foregoing embodiments, and reference may be made to the relevant description in the foregoing method embodiments.

In an exemplary embodiment, the present application provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the steps of the web page data acquisition method in the above embodiments. For example, the readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for acquiring webpage data is characterized by comprising the following steps:

2. The method for acquiring webpage data according to claim 1, wherein the updating the sub-depth attribute value according to the relevance between the webpage keyword and the target topic comprises:

3. The method for acquiring webpage data according to claim 1, wherein the queue to be crawled further comprises:

4. The method for acquiring webpage data according to claim 2, wherein before determining the queue to be crawled according to the set of URLs to be crawled and the knowledge pattern base, the method further comprises:

5. The method for acquiring webpage data according to claim 4, further comprising, after acquiring the webpage data pointed by the first URL:

6. The method for acquiring webpage data according to claim 5, wherein the determining the priority of the first URL includes:

7. A web page data acquisition apparatus, comprising:

8. The device for acquiring webpage data according to claim 7, wherein the update module is specifically configured to:

9. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the web page data acquisition method of any one of claims 1-6.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the web page data acquisition method according to any one of claims 1 to 6.