CN110968770B - Method and device for stopping crawling of crawler tool - Google Patents

Method and device for stopping crawling of crawler tool Download PDF

Info

Publication number
CN110968770B
CN110968770B CN201811145418.6A CN201811145418A CN110968770B CN 110968770 B CN110968770 B CN 110968770B CN 201811145418 A CN201811145418 A CN 201811145418A CN 110968770 B CN110968770 B CN 110968770B
Authority
CN
China
Prior art keywords
crawling
data
crawled
target data
crawler tool
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811145418.6A
Other languages
Chinese (zh)
Other versions
CN110968770A (en
Inventor
张鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201811145418.6A priority Critical patent/CN110968770B/en
Publication of CN110968770A publication Critical patent/CN110968770A/en
Application granted granted Critical
Publication of CN110968770B publication Critical patent/CN110968770B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for stopping crawling of a crawler tool, which are used for solving the problem that data crawled by the crawler tool are inaccurate when the crawler tool crawls according to different crawling tasks. The method comprises the following steps: obtaining a crawling result of a crawler tool; judging whether the crawling result meets a termination condition or not, wherein the termination condition can be configured according to crawling requirements; and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.

Description

Method and device for stopping crawling of crawler tool
Technical Field
The invention relates to the technical field of data crawling, in particular to a method and a device for stopping crawling of a crawler tool.
Background
A web crawler, also called a web spider, a web robot, is a program or script that automatically captures web information according to certain rules.
The web crawlers finish crawling according to termination conditions in the process of crawling data, for example: whether to finish crawling is determined according to the result of page loading completion, whether to finish crawling is determined according to the number of page turning times, or whether to finish crawling is determined according to crawling depth.
However, the traditional termination condition is relatively dead, when the web crawler crawls according to different crawling tasks, the web crawler can crawl data outside the target data more or can crawl partial data in the target data less, and therefore the problem that the data crawled by the web crawler is inaccurate is caused.
Disclosure of Invention
In view of the above problems, an object of an embodiment of the present invention is to provide a method and an apparatus for terminating crawling of a crawler tool, so as to solve the problem that when the crawler tool performs crawling according to different crawling tasks, data crawled by the crawler tool is inaccurate.
In order to solve the technical problems, the embodiment of the invention provides the following technical scheme:
in a first aspect, an embodiment of the present invention provides a method for terminating crawling of a crawler tool, the method including: obtaining a crawling result of a crawler tool; judging whether the crawling result meets a termination condition or not, wherein the termination condition can be configured according to crawling requirements; and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.
In other embodiments of the present invention, the determining whether the crawling result meets a termination condition includes: when the crawling result comprises crawled data, judging whether the crawled data comprises target data or not; if so, judging that the crawling result meets a termination condition; and/or when the crawling result comprises crawling parameters, judging whether the value of the crawling parameters reaches a preset value or not; and if so, judging that the crawling result meets the termination condition.
In other embodiments of the present invention, the determining whether the crawled data includes target data includes: respectively obtaining the characteristics of the crawled data and the characteristics of the target data; determining whether the crawled data comprises the target data according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result shows that the characteristics of the crawled data are matched with the characteristics of the target data, determining that the crawled data comprise the target data; and if the comparison result shows that the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the crawled data do not comprise the target data.
In other embodiments of the present invention, the determining whether the crawled data includes target data includes: determining whether the crawled data comprises the target data according to whether preset content is obtained, wherein the preset content is generated in a current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises the target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
In other embodiments of the present invention, the preset content includes: at least one of a page element and page request data.
In other embodiments of the present invention, the crawling parameters include: at least one of single page operation times, page turning times, crawling path depth, the number of pages to be crawled by the crawler tool at the next node and the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes.
In a second aspect, an embodiment of the present invention provides an apparatus for terminating crawling of a crawler tool, the apparatus comprising: an acquisition module configured to obtain a crawling result of the crawler tool; the judging module is configured to judge whether the crawling result meets a termination condition or not, and the termination condition can be configured according to crawling requirements; and the control module is configured to control the crawler tool to finish crawling if the crawling result meets the termination condition.
In other embodiments of the present invention, the determining module is configured to determine, when the crawling result includes crawled data, whether the crawled data includes target data; if so, judging that the crawling result meets the termination condition; and/or when the crawling result comprises crawling parameters, judging whether the value of the crawling parameters reaches a preset value or not; if so, judging that the crawling result meets the termination condition.
In other embodiments of the present invention, the determining module is configured to obtain a characteristic of the crawled data and a characteristic of the target data, respectively; determining whether the crawled data comprises target data according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result shows that the characteristics of the crawled data are matched with the characteristics of the target data, determining that the crawled data comprise the target data; and if the comparison result shows that the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the crawled data do not comprise the target data.
In other embodiments of the present invention, the determining module is configured to determine whether the crawled data includes target data according to whether preset content is obtained, where the preset content is generated in the current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method described in one or more of the above claims when the program is executed.
In a fourth aspect, embodiments of the present invention provide a computer readable storage medium, on which a computer program is stored, the program, when executed by a processor, implementing the method described in one or more of the above technical solutions.
The embodiment of the invention provides a method and a device for stopping crawling of a crawler tool, wherein first, a crawling result of the crawler tool is obtained; then, judging whether the crawling result meets the termination condition configured according to the crawling requirement; and finally, if the crawling result meets the termination condition, controlling the crawler tool to finish crawling. It can be seen that the crawling results of the crawler tool can finish crawling when the crawling results meet the terminating conditions according to the crawling requirements, that is, the crawler tool can finish crawling after the crawling requirements are met, the crawler tool is prevented from crawling unnecessary data in the crawling requirements more or crawling partial data needed in the crawling requirements less, and the accuracy of the crawling data of the crawler tool can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method of terminating crawling of a crawler tool in an embodiment of the present invention;
FIG. 2 is a schematic diagram of an apparatus for terminating crawling of a crawler tool according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. Other embodiments may be made by those of ordinary skill in the art without undue burden from these embodiments.
The embodiment of the invention provides a method for stopping crawling of a crawler tool, which can be applied to the process of crawling of a purposeful crawler tool in actual application, for example: crawling stock information, crawling weather information, crawling news, or the like. The in-process that the crawler instrument was crawling purposefully can be according to crawling the demand and dispose termination condition, makes the crawler instrument end the crawling when the crawling result of crawler instrument satisfies the demand of crawling, can improve the accuracy of the data that the crawler instrument of purposefully crawled.
A method for terminating crawling of a crawler tool according to an embodiment of the present invention will be described with reference to fig. 1.
Fig. 1 is a flow chart of a method for terminating crawling of a crawler tool according to an embodiment of the present invention, referring to fig. 1, the method includes:
s110: and obtaining a crawling result of the crawler tool.
The execution subject of the method for terminating crawling of the crawler tool may be the crawler tool itself, and the crawler tool may be a tool with a data crawling function, for example: the web crawler takes the crawler tool as an execution body, so that the crawling performance of the crawler tool can be improved, namely the accuracy of crawling data by the crawler tool can be improved; the execution main body of the method for stopping crawling of the crawler tool can be a computer program except the crawler tool, and the computer program is used as the execution main body, so that the accuracy of crawling data by the crawler tool can be improved, and meanwhile, the crawler tool can be prevented from occupying excessive space resources.
Here, the crawler tool performs crawling according to the crawling task, and the crawling result of the crawler tool may be data that the crawler tool crawls according to the crawling task, or crawling parameters when the crawler tool crawls according to the crawling task, for example: the method comprises the steps of one or more of single page operation times, page turning times, crawling path depth, the number of pages to be crawled by a crawler tool at the next node, and the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes.
S120: and judging whether the crawling result meets the termination condition.
The termination condition can be configured before the crawler tool performs crawling according to the crawling tasks each time, so that the crawler tool respectively accords with the requirements of each crawling task according to data crawled by different crawling tasks. When the termination condition is configured, the configuration can be performed according to the pre-configured termination condition in the crawling task which is required to be executed currently; when no pre-configured termination condition exists in the crawling task, the termination condition can be configured according to the crawling task; the termination condition may also be configured directly according to the crawling requirement, and the reference object when the termination condition is configured is not specifically limited herein.
Specifically, a plurality of termination conditions may be preset in the crawler tool, and then the user selects one or more termination conditions among the termination conditions preset in the crawler tool according to the crawling task before crawling using the crawler tool. For example: and (3) presetting a termination condition A, a termination condition B and a termination condition C in the crawler tool, wherein the preset termination condition in the crawling task D is the termination condition A, so that before the crawler tool performs crawling according to the crawling task D, a user can select the termination condition A as the termination condition of the crawling, so that the data of the crawling tool is the data required by the crawling task D.
S130: and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.
The step of judging whether the crawling result meets the termination condition may be to match the crawling result with the termination condition, if the data in the crawling result are all in the termination condition, the crawling result is determined to be successfully matched with the termination condition, and if the crawling result is determined to meet the termination condition, at this time, the crawling tool is controlled to finish crawling.
In addition, if the crawling result does not meet the termination condition, whether the crawling result meets the termination condition can be judged again, so that the problem that the crawling tool cannot be controlled to finish crawling due to wrong judgment when the crawling result is judged whether to meet the termination condition for the first time is avoided; the crawling method can also control the crawling tool to continue crawling, and re-acquire the crawling result of the crawling tool after a period of time is separated, and judge whether the re-acquired crawling result meets the termination condition or not so as to avoid infinite crawling of the crawling tool.
Here, it should be noted that, before crawling is not completed, the crawler tool continuously crawls according to the crawling task, where controlling the crawler tool to continue crawling means not controlling the crawler tool to end crawling, so that the crawler tool can continue crawling data according to the crawling task.
In practical application, firstly, a user configures a termination condition according to crawling requirements; then, the crawler tool performs crawling according to the crawling task; then, the crawler tool obtains a crawling result and judges whether the crawling result meets a termination condition; and finally, if the crawling result meets the termination condition, finishing crawling by the crawler tool.
Therefore, before the crawler tool crawls according to the crawling task, the termination condition can be flexibly configured according to the crawling requirement, so that the crawling result of the crawler tool can finish crawling when the termination condition is met, that is, the crawler tool can finish crawling after the crawling requirement is met, the crawler tool is prevented from crawling unnecessary data in the crawling requirement more or crawling partial data in the crawling requirement less, and the accuracy of the crawling data of the crawler tool can be improved.
Based on the foregoing embodiments, in order to more accurately and conveniently determine whether the crawling result satisfies the termination condition. Further, S120 includes:
s121: judging whether the crawled data comprises target data or not;
s122: and judging whether the value of the crawling parameter reaches a preset value.
Wherein, S121 and S122 are not sequential, S121 and S122 may be selectively executed, and one of S121 and S122 may be selectively executed.
Specifically, when the termination condition configured for the crawler tool is a data termination condition, the crawling result may be data that the crawler tool has crawled, may be page request data, or may be an element in a page. If the crawled data comprises target data, or request data appears in the page, or a certain element disappears in the page, the crawled result is judged to meet the termination condition, so that whether the crawled result meets the termination condition can be accurately and conveniently judged.
In general, data required by a crawling task form target data, whether the data currently crawled by a crawler tool is the data required by the crawling task can be judged by judging whether the crawled data comprises the target data, and whether the crawler tool is controlled to finish crawling can be further directly determined.
For example: if the data required by the crawling task is A and B, determining that the data currently crawled by the crawling tool is not all the data required by the crawling task if the data currently crawled by the crawling tool is A, and further not controlling the crawling tool to finish crawling so that the crawling tool continues crawling; if the data currently crawled by the crawler tool is A and B, determining that the data currently crawled by the crawler tool is all data required by a crawling task, and further controlling the crawler tool to finish crawling; if the data currently crawled by the crawler tool is C, determining that the data currently crawled by the crawler tool is not the data required by the crawling task, and further generating prompt information to remind the user that the data crawled by the crawler tool is not the data required by the crawling task.
Furthermore, when the termination condition configured for the crawler tool is a behavior termination condition, the crawling result may be a crawling parameter, where the crawling parameter is data related to an operation performed by the crawler tool in the crawling process, for example: at least one of single page operation times, page turning times, crawling path depth, the number of pages to be crawled by a crawler tool at the next node, and the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes. If the crawling parameter reaches a preset value, wherein the preset value refers to a preset value in the termination condition, the crawling result is judged to meet the termination condition, so that whether the crawling result meets the termination condition can be accurately and conveniently judged.
Based on the foregoing embodiments, in order to make the data crawled by the crawler tool more accurate. Further, S121 includes:
s1211a: and respectively obtaining the characteristics of the crawled data and the characteristics of the target data.
The characteristics of the crawled data are obtained based on the crawled data, wherein the crawled data are crawled by a crawler tool according to a crawling task, and each time the crawler tool crawls one data, the characteristics of the data, namely the characteristics of the crawled data, can be obtained. The characteristics of the target data are obtained based on the target data, the target data are data required by the crawling task, and the common characteristics of the data required by the crawling task, namely the characteristics of the target data, can be obtained according to the data required by the crawling task.
Here, the feature refers to one physical quantity capable of indicating the attribute of data, for example: the time of generation of the data, the type of the data, etc.
S1211b: and determining whether the crawled data comprises the target data according to the comparison result of the characteristics of the crawled data and the characteristics of the target data.
The characteristics of the crawled data are compared with the characteristics of the target data, which can be compared by a crawler tool, so that the real-time performance of the crawler tool in processing the data can be improved; the comparison can be performed by computer programs except the crawler tool, so that the crawler tool can be prevented from occupying excessive space resources. The execution subject that compares the characteristics of the crawled data with those of the target data is not particularly limited herein.
Specifically, if the characteristics of the crawled data are matched with the characteristics of the target data, determining that the current data crawled by the crawler tool is the target data, and storing the data; if the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the current data crawled by the crawler tool is not the target data, and deleting the data.
When the number of posts stored by the crawler tool is the same as the number of posts required by the crawling task after the crawler tool crawls for a period of time according to the crawling task, the crawled data can be determined to comprise target data; however, when the crawler tool crawls for a period of time according to the crawling task, the crawler tool does not store any posts, or the number of posts stored by the crawler tool is significantly smaller than the number of posts required by the crawling task, it is determined that the crawled data does not include the target data.
Therefore, according to the comparison result of the characteristics of the crawled data and the characteristics of the target data, whether the crawled data comprises the target data can be accurately determined, more crawling of data irrelevant to the target data by a crawler tool can be more accurately avoided, or less crawling of the target data is realized, and the crawling of the data by the crawler tool is more accurate.
S121 will be described below with specific examples.
The crawling task needs all posts published in 2018 month 8 in a certain automobile forum, a crawling tool crawls the posts in the automobile forum according to the crawling task, and each time the crawling tool crawls one post, the crawling tool can obtain the posting time of the post, and the posting time of the post can have the following two conditions:
in the first case, the posting time of the post is 2018, the posting time of the post is the same as the posting time of the post required by the crawling task, namely 2018, the post can be determined to be the post required by the crawling task, and then the post is stored;
in the second case, if the posting time of the post is not 2018, for example, 2018, 7, and the posting time of the post is different from the posting time of the post required by the crawling task, it may be determined that the post is not the post required by the crawling task, and then the post is deleted.
Then, after the crawler tool crawls for a period of time according to the crawling task, when the number of posts stored by the crawler tool is the same as the number of posts required by the crawling task, determining that the crawled data comprises target data, wherein the number of posts required by the crawling task can be preset or exist in a termination condition; however, when the crawler tool crawls for a period of time according to the crawling task, the crawler tool does not store any posts, or the number of posts stored by the crawler tool is significantly smaller than the number of posts required by the crawling task, it is determined that the crawled data does not include the target data.
Based on the foregoing embodiments, it is convenient to determine whether the crawled data includes target data. Further, S121 includes:
s1212a: and determining whether the crawled data comprises target data according to whether preset content is obtained.
The preset content is generated on the current page after the crawler tool crawls the target data. That is, only after the crawler tool crawls the target data, the page generates preset content, and the crawler tool can know that the target data is crawled through the preset content.
Here, the preset content may be page Request data, i.e., a Request; the preset content may also be page Response data, i.e. Response; the preset content may also be a specific element on the page. The implementation form of the preset content is not particularly limited herein.
Specifically, when the crawler tool crawls the page according to the crawling task, the crawler tool or a computer program except the crawler tool obtains preset content, and it can be determined that the crawler tool crawls the target data, that is, it is determined that the crawled data comprises the target data; the crawler tool or the computer program except the crawler tool still does not acquire the preset content after crawling for a period of time, and it can be determined that the crawler tool does not crawl the target data, that is, it is determined that the crawled data does not include the target data.
Therefore, whether the crawled data comprise target data can be conveniently determined by confirming whether the page generates preset content, and the situation that the crawler tool crawls more data irrelevant to the target data or crawls less part of data in the target data can be avoided.
Based on the foregoing embodiments, whether the page generates the preset content is confirmed more conveniently. Further, the preset content includes: at least one of a page element and page request data.
Wherein, the page element can be an element in a page associated with the target data, and when the target data is crawled, the page element can be from none to existence or from existence to non existence; likewise, the page request data may be request data in a page associated with the target data, which may be generated in the current page when the target data is crawled.
Specifically, the crawler tool may be determined to crawl the target data according to the page element generated on the page, or may crawl the target data according to the page element disappeared on the page. Here, the page element may be a symbol or a piece of information, which is not particularly limited herein. Whether the page generates the preset content can be conveniently confirmed by judging whether the page element is generated on the page or not or whether the page element is disappeared on the page or not.
For example: when the crawling task is required to crawl all posts published in 2018 month 8 on a certain automobile forum, and when the crawler tool crawls all posts published in 2018 month 8 on the automobile forum, the page can generate a prompt message for prompting that all posts published in 2018 month 8 are crawled, or the elements with the words of 2018 month 8 on the page disappear, so that the page can be clearly and conveniently confirmed to generate preset contents.
Furthermore, whether the page generates the preset content can be determined according to whether the page request data is generated on the page. Here, the page request data may also refer to page response data.
Specifically, the page request data is related to the crawling task, when crawling to target data, the page can generate the page request data, and a crawler tool can intercept the page request data to determine that the page generates preset content; when the crawler tool does not crawl the target data, the page does not generate page request data, the crawler tool does not have the page request data which can be intercepted, and the crawler tool does not intercept the page request data, so that the page is determined to not generate preset content. Therefore, whether the page generates the preset content can be clearly and conveniently confirmed through whether the page request data is generated on the page.
For example: when the crawling task is required to crawl all posts published in 2018 month 8 on a certain automobile forum, when the crawling tool crawls all posts published in 2018 month 8 on the automobile forum, the page generates page request data to prompt a user whether to continuously crawl posts published in other than 2018 month 8, and when the page request data are acquired, the user can clearly and conveniently confirm that preset contents are generated on the page.
Based on the foregoing embodiments, in order to make the data crawled by the crawler tool more accurate. Further, crawling parameters include: the method comprises the steps of at least one of single page operation times, page turning times, crawling path depth, the number of pages required to be crawled by a crawler tool at the next node and the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes.
In particular, when the crawling parameter is a single page number of operations, the number of operations here may be the number of clicks of the crawler tool on the page element. When the crawler tool performs crawling according to the crawling task, if the number of clicks of the crawler tool on the page is greater than 50, indicating that the crawler tool has sufficiently crawled the page, determining that the value of the crawling parameter is greater than a preset value; if the number of clicks of the crawler tool on the page is less than or equal to 50, which indicates that the crawler tool does not adequately crawl the page, it is determined that the value of the crawling parameter is less than or equal to a preset value.
Here, the number of clicks of the crawler tool on the page element of the current page can be easily obtained through the operation behavior of the crawler tool, so that whether the value of the crawling parameter is larger than the preset value can be conveniently judged according to the number of clicks of the crawler tool on the page element of the current page. The number of clicks 50 is merely an example, and the number of clicks may be 40 or 60, and is not particularly limited.
Or when the crawling parameter is the page turning times, the page turning times refer to the times of pages which are turned in the process that the crawler tool crawls a plurality of pages. When the crawler tool performs crawling according to the crawling task, if the number of page turning times of the crawler tool is larger than 20, indicating that the crawler tool has sufficiently crawled, determining that the value of the crawling parameter is larger than a preset value; if the number of times of turning pages of the crawler tool is less than or equal to 20, which indicates that the crawler tool does not sufficiently crawl the pages, the value of the crawling parameter is determined to be less than or equal to a preset value.
Here, the number of times of turning pages of the crawler tool is also easily obtained through the operation behavior of the crawler tool, so that whether the value of the crawling parameter is larger than the preset value can be conveniently judged according to the number of times of turning pages of the crawler tool. The number of turns 20 is merely an example, and the number of clicks may be 10 or 30, and is not particularly limited.
Furthermore, when the crawling parameter is a crawling path depth, the crawling path depth refers to the depth that the crawler tool has crawled in the crawling path. When the crawler tool performs crawling according to the crawling task, if the crawling path depth of the crawler tool is greater than 3, indicating that the crawler tool has performed full crawling, determining that the value of the crawling parameter is greater than a preset value; if the crawling path depth of the crawler tool is smaller than or equal to 3, the crawler tool is not fully crawled, and the value of the crawling parameter is determined to be smaller than or equal to a preset value.
Here, the crawling path depth of the crawler tool is also easily obtained through the operation behavior of the crawler tool, so that whether the value of the crawling parameter is larger than the preset value can be conveniently judged according to the crawling path depth of the crawler tool. The number of times of turning the page 3 is merely an example, and the number of clicks may be 2 or 4, and is not particularly limited.
Also, when the crawling parameter is the number of pages that the crawling tool needs to crawl at the next node, if the crawling task needs to crawl information of all users in a certain forum, the crawling chain corresponding to the crawling task is: forum list page- > content detail page- > user information page. The forum list page is a home page or a homepage of the forum, and a list formed by names of various information exists in the forum list page; the content detail page is a page after entering from the name of certain information in the list; the user information page is a page after entering from the content detail page, on which user information is displayed.
Specifically, after the crawler tool enters the content detail page of the information a, when the number of pages of the user information to be crawled in the next step is greater than or equal to 50, it is determined that the value of the crawling parameter is greater than a preset value, although the target data may still exist in the next page of the content detail page of the information a, the importance of acquiring the target data at the position is lower than that of acquiring all the data to be crawled in the crawling task, and therefore, it is determined that the value of the crawling parameter is greater than the preset value at the position, so that the crawling efficiency can be improved. If after the crawler tool enters the content detail page of the information A, when the number of pages of user information to be crawled in the next step is smaller than 50, the crawled data is determined to not include the target data, so that the crawling time is not wasted, the crawling of the target data can be avoided, and the completeness of the crawled data can be improved.
Here, the number of pages 50 is merely an example, and the number of pages may be 30, 40, 60, 70, etc., which is not particularly limited herein.
And when the ratio of the number of the nodes behind the current node of the crawler to the total number of the nodes is less than or equal to 30%, namely, when the crawler finishes the task of crawling 70% or more of the nodes on the chain, the crawler is required to crawl most of all data according to the crawling task, and then the value of the crawling parameter is determined to be greater than a preset value. If the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes is greater than 30%, that is, if the number of nodes on the crawling chain completed by the crawler tool is less than 70% of the total number of nodes, the fact that the crawler tool has not crawled most of all data required by the crawling task is indicated, and then the value of the crawling parameter is determined to be greater than a preset value. Thus, the accuracy of the crawling data of the crawler tool can be improved, and the crawling efficiency of the crawler tool can also be improved.
It should be noted that the above ratio of 30% is only an example, and the above ratio may be 20%, 40%, etc., which is not particularly limited herein.
In addition, before each crawling action is performed by the crawler tool, a crawling result is obtained, and whether the crawling result meets a termination condition is judged. Therefore, the crawler tool can be effectively prevented from crawling any data irrelevant to the target data, and the crawler tool can accurately crawl.
The following describes the whole working process of the method for terminating crawling of a crawler tool according to an embodiment of the present invention with a specific example.
In the process that the crawler tool performs crawling according to the crawling task, firstly, the crawler tool obtains a crawling result of the crawler tool, wherein the crawling result can be data which the crawler tool has crawled or can be crawling parameters of the crawler tool; then, judging whether a crawling result meets a termination condition, wherein the termination condition is configured by a user before a crawling task is executed by using a crawler tool, the termination condition can be a data termination condition, the data termination condition can be used for judging whether the crawled data comprises target data, the termination condition can also be a behavior termination condition, the behavior termination condition can be used for judging whether the value of a crawling parameter reaches a preset value, and when judging whether the crawling result meets the termination condition, the method can determine whether the crawled data comprises the target data by comparing the characteristics of the crawled data with the characteristics of the target data, and can also determine whether the crawled data comprises the target data by judging whether page elements or page request data appear on a current page; and finally, when the crawling result is determined to meet the termination condition, the crawler tool finishes crawling.
Like this, can make the crawler instrument end after satisfying the demand of crawling and crawl, avoid the crawler instrument to crawl more and above-mentioned data that need in the demand of crawling, perhaps little crawl the partial data that need in the demand of crawling, can improve the accuracy of the data that the crawler instrument was got.
Based on the same inventive concept, the embodiment of the invention also provides a device for stopping crawling of the crawler tool. Fig. 2 is a schematic structural diagram of an apparatus for terminating crawling of a crawler tool according to an embodiment of the present invention, referring to fig. 2, the apparatus 200 for terminating crawling of a crawler tool includes: an acquisition module 210 configured to obtain crawling results for the crawler tool; a determining module 220 configured to determine whether the crawling result meets a termination condition, where the termination condition can be configured according to the crawling requirement; the control module 230 is configured to control the crawler tool to end crawling if the crawling result meets the termination condition.
Based on the above embodiment, the determining module is configured to determine whether the crawled data includes target data when the crawled result includes crawled data; if so, judging that the crawling result meets the termination condition; and/or when the crawling result comprises crawling parameters, judging whether the value of the crawling parameters reaches a preset value or not; if so, judging that the crawling result meets the termination condition.
Based on the above embodiment, the judging module is configured to obtain the characteristics of the crawled data and the characteristics of the target data, respectively; determining whether the crawled data comprises target data according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result shows that the characteristics of the crawled data are matched with the characteristics of the target data, determining that the crawled data comprise the target data; and if the comparison result shows that the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the crawled data do not comprise the target data.
Based on the above embodiment, the judging module is configured to determine whether the crawled data includes target data according to whether preset content is obtained, where the preset content is generated in the current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
Based on the above embodiment, the preset contents include: at least one of a page element and page request data.
Based on the above embodiment, the crawling parameters include: the method comprises the steps of at least one of single page operation times, page turning times, crawling path depth, the number of pages required to be crawled by a crawler tool at the next node and the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes.
It should be noted here that: the description of the apparatus embodiments above is similar to that of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present invention, please refer to the description of the embodiments of the method of the present invention.
Based on the same inventive concept, the embodiment of the present invention further provides an electronic device, which may include the device for terminating crawling of the crawler tool in the above embodiment, where the electronic device may be a server, a personal computer, or the like. Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and referring to fig. 3, the electronic device 300 includes: at least one processor 301; and at least one memory 302, bus 303 connected to the processor 301; wherein, the processor 301 and the memory 302 complete communication with each other through the bus 303; processor 301 is configured to invoke program instructions in memory 302 to perform a method of terminating crawling of a crawler tool as in one or more of the embodiments described above, processor 301 being configured to obtain crawling results of the crawler tool; judging whether the crawling result meets a termination condition or not, wherein the termination condition can be configured according to crawling requirements; and if the crawling result meets the termination condition, controlling the crawler tool to finish crawling.
Based on the above embodiment, the processor is configured to determine whether the crawled data includes target data when the crawled result includes crawled data; if so, judging that the crawling result meets the termination condition; and/or when the crawling result comprises crawling parameters, judging whether the value of the crawling parameters reaches a preset value or not; if so, judging that the crawling result meets the termination condition.
Based on the above embodiment, the processor is configured to obtain the characteristics of the crawled data and the characteristics of the target data, respectively; determining whether the crawled data comprises target data according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result shows that the characteristics of the crawled data are matched with the characteristics of the target data, determining that the crawled data comprise the target data; and if the comparison result shows that the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the crawled data do not comprise the target data.
Based on the above embodiment, the processor is configured to determine whether the crawled data includes target data according to whether preset content is obtained, where the preset content is generated in the current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
Based on the above embodiment, the preset contents include: at least one of a page element and page request data.
Based on the above embodiment, the crawling parameters include: the method comprises the steps of at least one of single page operation times, page turning times, crawling path depth, the number of pages required to be crawled by a crawler tool at the next node and the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes.
It should be noted here that: the description of the electronic device embodiments above is similar to that of the method embodiments above, with similar benefits as the method embodiments. For technical details not disclosed in the embodiments of the electronic device according to the embodiments of the present invention, please refer to the description of the method embodiments of the present invention for understanding.
Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a method of terminating crawling of crawler tools as in one or more of the embodiments described above.
It should be noted here that: the description of the computer-readable storage medium embodiments above is similar to that of the method embodiments described above, with similar benefits as the method embodiments. For technical details not disclosed in embodiments of the computer-readable storage medium of embodiments of the present invention, please refer to the description of method embodiments of the present invention.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (9)

1. A method of terminating crawling of a crawler tool, the method comprising:
obtaining a crawling result of a crawler tool;
judging whether the crawling result meets a termination condition or not, wherein the termination condition can be configured according to crawling requirements;
if the crawling result meets the termination condition, controlling the crawler tool to finish crawling;
wherein, the determining whether the crawling result meets a termination condition includes:
when the crawling result comprises crawled data, judging whether the crawled data comprises target data or not; if the data is included, judging that the crawling result meets the termination condition, wherein the data required by the crawling task form target data;
wherein the determining whether the crawled data includes target data includes:
determining whether the crawled data comprises the target data according to whether preset content is obtained, wherein the preset content is generated in a current page after the crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises the target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
2. The method of claim 1, wherein the determining whether the crawling result satisfies a termination condition comprises:
when the crawling result comprises crawling parameters, judging whether the value of the crawling parameters reaches a preset value or not; and if so, judging that the crawling result meets the termination condition.
3. The method of claim 1, wherein the determining whether the crawled data comprises target data comprises:
respectively obtaining the characteristics of the crawled data and the characteristics of the target data;
determining whether the crawled data comprises the target data according to a comparison result of the characteristics of the crawled data and the characteristics of the target data;
if the comparison result shows that the characteristics of the crawled data are matched with the characteristics of the target data, determining that the crawled data comprise the target data; and if the comparison result shows that the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the crawled data do not comprise the target data.
4. The method of claim 1, wherein the preset content comprises: at least one of a page element and page request data.
5. The method of claim 2, wherein the crawling parameters comprise: at least one of single page operation times, page turning times, crawling path depth, the number of pages to be crawled by the crawler tool at the next node and the ratio of the number of nodes behind the current node of the crawler tool to the total number of nodes.
6. An apparatus for terminating crawling of a crawler tool, the apparatus comprising:
an acquisition module configured to obtain a crawling result of the crawler tool;
the judging module is configured to judge whether the crawling result meets a termination condition or not, and the termination condition can be configured according to crawling requirements;
the control module is configured to control the crawler tool to finish crawling if the crawling result meets the termination condition;
the judging module is configured to judge whether the crawled data comprises target data or not when the crawled result comprises crawled data; if the data is included, judging that the crawling result meets the termination condition, wherein the data required by the crawling task form target data;
the judging module is configured to determine whether the crawled data comprises target data according to whether preset content is obtained, wherein the preset content is generated in a current page after a crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
7. The apparatus according to claim 6, wherein:
the judging module is configured to judge whether the value of the crawling parameter reaches a preset value or not when the crawling result comprises the crawling parameter; if so, judging that the crawling result meets the termination condition; and/or the number of the groups of groups,
the judging module is configured to obtain the characteristics of the crawled data and the characteristics of the target data respectively; determining whether the crawled data comprises target data according to a comparison result of the characteristics of the crawled data and the characteristics of the target data; if the comparison result shows that the characteristics of the crawled data are matched with the characteristics of the target data, determining that the crawled data comprise the target data; if the comparison result shows that the characteristics of the crawled data are not matched with the characteristics of the target data, determining that the crawled data do not comprise the target data; and/or the number of the groups of groups,
the judging module is configured to determine whether the crawled data comprise target data according to whether preset content is obtained, wherein the preset content is generated in a current page after a crawler tool crawls the target data; if the preset content is obtained, determining that the crawled data comprises target data; and if the preset content is not obtained, determining that the crawled data does not comprise the target data.
8. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any one of claims 1 to 5 when executing the program.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 5.
CN201811145418.6A 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool Active CN110968770B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811145418.6A CN110968770B (en) 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811145418.6A CN110968770B (en) 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool

Publications (2)

Publication Number Publication Date
CN110968770A CN110968770A (en) 2020-04-07
CN110968770B true CN110968770B (en) 2023-09-05

Family

ID=70027161

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811145418.6A Active CN110968770B (en) 2018-09-29 2018-09-29 Method and device for stopping crawling of crawler tool

Country Status (1)

Country Link
CN (1) CN110968770B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device
CN113419781A (en) * 2021-07-19 2021-09-21 湖南四方天箭信息科技有限公司 Crawler method and device based on Chrome plug-in, computer equipment and storage medium

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009179286A (en) * 2008-02-01 2009-08-13 Mitsubishi Agricult Mach Co Ltd Working vehicle
US8162410B2 (en) * 2004-12-20 2012-04-24 Tokyo Institute Of Technology Endless elongated member for crawler and crawler unit
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN104281608A (en) * 2013-07-08 2015-01-14 上海锐英软件技术有限公司 Emergency analyzing method based on microblogs
CN104408195A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Crawler working state judging method and device
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105630987A (en) * 2015-12-25 2016-06-01 北京搜狗科技发展有限公司 User agent self-adaption uniform resource locator prefix mining method and device
CN105740460A (en) * 2016-02-24 2016-07-06 中国科学技术信息研究所 Webpage collection recommendation method and device
CN105760508A (en) * 2016-02-23 2016-07-13 北京搜狗科技发展有限公司 Information push method and device and electronic equipment
CN106021257A (en) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 Method, device, and system for crawler to capture data supporting online programming
CN106407218A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage detection method and device
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8162410B2 (en) * 2004-12-20 2012-04-24 Tokyo Institute Of Technology Endless elongated member for crawler and crawler unit
JP2009179286A (en) * 2008-02-01 2009-08-13 Mitsubishi Agricult Mach Co Ltd Working vehicle
CN102646129A (en) * 2012-03-09 2012-08-22 武汉大学 Topic-relative distributed web crawler system
CN104281608A (en) * 2013-07-08 2015-01-14 上海锐英软件技术有限公司 Emergency analyzing method based on microblogs
CN104408195A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Crawler working state judging method and device
CN106407218A (en) * 2015-07-31 2017-02-15 北京国双科技有限公司 Navigation webpage detection method and device
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105630987A (en) * 2015-12-25 2016-06-01 北京搜狗科技发展有限公司 User agent self-adaption uniform resource locator prefix mining method and device
CN106021257A (en) * 2015-12-31 2016-10-12 广州华多网络科技有限公司 Method, device, and system for crawler to capture data supporting online programming
CN107025230A (en) * 2016-01-29 2017-08-08 北京国双科技有限公司 The processing method and processing device of web crawlers
CN105760508A (en) * 2016-02-23 2016-07-13 北京搜狗科技发展有限公司 Information push method and device and electronic equipment
CN105740460A (en) * 2016-02-24 2016-07-06 中国科学技术信息研究所 Webpage collection recommendation method and device
CN107885820A (en) * 2017-11-07 2018-04-06 北京小度互娱科技有限公司 Breadth traversal orientation grasping means based on crawler system
CN108415941A (en) * 2018-01-29 2018-08-17 湖北省楚天云有限公司 A kind of spiders method, apparatus and electronic equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
自适应遗传算法在主题爬虫搜索策略中的应用研究;荆文鹏等;《计算机科学》;第254-257页 *

Also Published As

Publication number Publication date
CN110968770A (en) 2020-04-07

Similar Documents

Publication Publication Date Title
CN109829096B (en) Data acquisition method and device, electronic equipment and storage medium
KR102280023B1 (en) How to perform service operations between systems, service platforms and target systems
CN110968770B (en) Method and device for stopping crawling of crawler tool
CN112581018B (en) Method, system, device and storage medium for managing process tasks
EP3164795A1 (en) Prompting login account
CN106980687B (en) Resource downloading system, method and crawler downloading system
CN105450583A (en) Information authentication method and device
CN110826978A (en) Unified backlog processing method and device based on enterprise browser
CN116384295B (en) Top file generation method and device, computer equipment and storage medium
CN111090669A (en) Data query method and device based on space-time collision
CN106919503B (en) Application program testing method and device
CN109684351B (en) Execution plan viewing method, device, server and storage medium
CN113806365B (en) Single data source data management method, device and storage medium
CN115599728A (en) Slot position determining method, device and equipment of FRU equipment and readable storage medium
CN111611273B (en) Method, device, equipment and readable storage medium for associating equipment with equipment file
CN105740131B (en) Software user behavior rollback processing method and device
CN108681455B (en) Method and device for converting graph and code
CN114118811A (en) Service code generation method, service code execution method, service code generation device, service code execution equipment and storage medium
CN112579956A (en) Website account management method and device and electronic equipment
CN111176576A (en) Metadata modification method, device, equipment and storage medium of storage volume
CN108334570A (en) Method, apparatus, server and the storage medium of hierarchical query
CN111401020A (en) Interface loading method and system and computing equipment
JP2009075738A (en) Retrieval result refining system, retrieval result refining method, and retrieval result refining program
CN110928954A (en) HBase index synchronization method, HBase index synchronization device, computer equipment and storage medium
US11650796B2 (en) Method for assisting a utilizer in creating a software application and computer program having an implementation of the method and also programming interface usable for such method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant