CN113065055B

CN113065055B - News information capturing method and device, electronic equipment and storage medium

Info

Publication number: CN113065055B
Application number: CN202110432611.3A
Authority: CN
Inventors: 郑德生
Original assignee: Shenzhen Saiante Technology Service Co Ltd
Current assignee: Shenzhen Saiante Technology Service Co Ltd
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2024-04-02
Anticipated expiration: 2041-04-21
Also published as: CN113065055A

Abstract

The invention relates to the technical field of big data, and provides a news information grabbing method, a news information grabbing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of seed URLs to generate a target news information capture tree; starting a main thread to read a target seed URL of each grabbing node in the target news information grabbing tree and a corresponding grabbing strategy; when a preset number of target seed URLs are read, starting a plurality of sub-threads, and dividing the preset number of target seed URLs into the plurality of sub-threads; controlling each sub-thread to open each target seed URL by using a Puppeterer to carry out grabbing processing; and counting the grabbing results of the plurality of sub-threads through the main thread to obtain target grabbing results of the target news information. According to the invention, the Puppeterer is used for starting the headless browser to open each target seed URL, and starting a plurality of sub-threads to carry out grabbing processing, so that the rendering work of the real browser is reduced, and the grabbing efficiency of target news information is improved.

Description

News information capturing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of big data, in particular to a news information grabbing method, a news information grabbing device, electronic equipment and a storage medium.

Background

The traditional news information capturing is to acquire an http request corresponding to a URL of a website through a crawler program and analyze a result returned by the http request, but most news information web pages at present acquire information content through ajax, rendering of page content is realized through Javascript, the traditional crawler cannot capture effective data or can capture only part of the effective data, and in addition, some programs capture news information content through opening a browser and through the position of DOM elements.

However, since these programs must run on the visualized operating system, there is no way to run on the linux server, resulting in inefficiency and low accuracy of the captured news information.

Therefore, it is necessary to provide a quick and accurate news information capturing method.

Disclosure of Invention

In view of the foregoing, it is necessary to provide a news information capturing method, a device, an electronic apparatus, and a storage medium, which start a headless browser to open each target seed URL by using a puppeter, and start a plurality of sub-threads to perform capturing processing, so that rendering work of a real browser is reduced, and capturing efficiency of target news information is improved.

A first aspect of the present invention provides a news information capturing method, the method including:

analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;

creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;

starting a main thread to read target seed URLs of each grabbing node in the target news information grabbing tree one by one and corresponding grabbing strategies;

when the main thread is detected to read a preset number of target seed URLs, starting a plurality of sub-threads, and dividing the preset number of target seed URLs read by the main thread into the plurality of sub-threads according to a preset distribution rule;

controlling each sub-thread to open each target seed URL read by the main thread by using a Puppeterer, and performing grabbing processing;

and after detecting that the plurality of sub-threads finish grabbing processing, counting grabbing results of the plurality of sub-threads through the main thread to obtain target grabbing results of the target news information.

Optionally, the creating a crawling policy for each of the seed URLs includes:

analyzing the page content and the page structure in each seed URL to obtain an analysis result;

acquiring grabbing requirements corresponding to each seed URL;

and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.

Optionally, the generating the target news information-capturing tree according to the plurality of seed URLs includes:

converting the grabbing node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information grabbing tree;

converting a reference relationship between grabbing nodes of each seed URL in the plurality of seed URLs into edges between nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree serve as the reference relationship between the nodes of the target news information grabbing tree;

and generating a target news information capture tree according to edges between the nodes of the target news information capture tree and the nodes in the target news information capture tree.

Optionally, the controlling each sub-thread to use puppeter to open each target seed URL read by the main thread, and performing the crawling processing includes:

Starting a headless browser to open each target seed URL read by the main thread and a corresponding grabbing strategy by using a Puppeterer;

jumping to a target page corresponding to the target seed URL;

and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to the target seed URL.

Optionally, the method further comprises:

detecting whether an abnormal event occurs to a child thread;

when detecting that an abnormal event occurs to a child thread, identifying a target grabbing node corresponding to the child thread with the abnormal event;

and verifying the target seed URL and the corresponding target grabbing strategy in the target grabbing node.

Optionally, the verifying the target seed URL and the corresponding target crawling policy in the target crawling node includes:

matching a target seed URL in the target grabbing node with the seed URLs;

when the target seed URL in the target grabbing node is matched with any one seed URL in the plurality of seed URLs, judging whether the target grabbing strategy is the grabbing strategy of the target seed URL;

when the target grabbing strategy is the grabbing strategy of the target seed URL, sending grabbing suggestions to a client; or alternatively

When the target grabbing strategy is not the grabbing strategy of the target seed URL, correcting the grabbing strategy in the target grabbing node, and secondarily grabbing the target seed URL in the target grabbing node according to the corrected grabbing strategy.

A second aspect of the present invention provides a news information-capturing method, the method including:

creating a grabbing strategy for each seed URL, and associating each seed URL with a corresponding grabbing strategy;

starting a target main thread to create a task queue, reading each associated seed URL and sequentially sending the seed URL to the task queue;

judging whether the same seed URL exists in the task queue;

when the same seed URL does not exist in the task queue, inquiring whether idle sub-threads exist in a plurality of target sub-threads started by the target main thread or not at intervals of a preset period;

when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads;

Controlling the idle sub-thread to open each seed URL read by the target main thread by using a Puppeterer, and capturing target news information for each seed URL according to a corresponding capturing strategy;

and counting the grabbing results of the idle sub-threads through the target main thread after the idle sub-threads are detected to finish grabbing target news information, so as to obtain grabbing results of the target news information.

A third aspect of the present invention provides a news information-capturing device, the device including:

the analysis module is used for analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;

the generation module is used for creating a grabbing strategy for each seed URL and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;

the reading module is used for starting the main line Cheng Zhuge to read the target seed URL and the corresponding grabbing strategy of each grabbing node in the target news information grabbing tree;

the starting module is used for starting a plurality of sub-threads when detecting that the main thread reads a preset number of target seed URLs, and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule;

The grabbing module is used for controlling each sub-thread to open each target seed URL read by the main thread by using the Puppeterer and carrying out grabbing processing;

and the statistics module is used for counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing results of the target news information after the grabbing processing of the plurality of sub-threads is detected.

A fourth aspect of the present invention provides an electronic device comprising a processor and a memory, the processor being configured to implement the news information-capturing method when executing a computer program stored in the memory.

A fifth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the news information-capturing method.

In summary, according to the news information capturing method, device, electronic equipment and storage medium of the present invention, on one hand, each sub-thread is controlled to use puppeter to open the target seed URL of each target capturing node read by the main thread and perform capturing processing, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing processing of target news information is performed by starting a plurality of sub-threads and using puppeter, so that the capturing efficiency of target news information is improved; on the other hand, starting a main thread to read target seed URLs of target grabbing nodes in the seed URL target news information grabbing tree and corresponding grabbing strategies one by one, avoiding the phenomenon of omission or repeated reading in the process of reading the target seed URLs, and improving the reading accuracy of the target seed URLs; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the seed URLs, and improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategy for each seed URL, generating the target news information grabbing tree according to the seed URLs, grabbing from each node of the target news information grabbing tree, thereby avoiding the phenomenon of repeated grabbing or missing grabbing of the seed URLs and improving the grabbing accuracy of the target news information.

Drawings

Fig. 1 is a flowchart of a news information capturing method according to an embodiment of the invention.

Fig. 2 is a flowchart of a news information capturing method according to a second embodiment of the present invention.

Fig. 3 is a block diagram of a news information-capturing device according to a third embodiment of the present invention.

FIG. 4 is a block diagram of a news information-capturing device according to a fourth embodiment of the present invention

Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present invention and features in the embodiments may be combined with each other.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Example 1

In this embodiment, the news information capturing method may be applied to an electronic device, and for an electronic device that needs to capture news information, the news information capturing function provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in the form of a software development kit (Software Development Kit, SDK).

As shown in FIG. 1, the news information capturing method specifically includes the following steps, and the order of the steps in the flowchart may be changed according to different requirements, and some may be omitted.

S11, analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs.

In this embodiment, when the target news information needs to be captured, a capturing request is initiated to a server through a client, specifically, the client may be a mobile phone, an IPAD or other existing devices with a sending function, the server may be a capturing subsystem, in the capturing process, for example, the client may send the capturing request to the capturing subsystem, and when the server receives the capturing request sent by the client, the capturing request is analyzed.

In this embodiment, the crawling request includes crawling requirements, a seed URL, page content of the seed URL, page structure, and the like corresponding to the crawling target news information.

S12, creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy.

In this embodiment, since the page content, the page structure, and the crawling requirements corresponding to each seed URL are different, a different crawling policy is created for each seed URL.

In an alternative embodiment, said creating a crawling policy for each of said seed URLs comprises:

acquiring grabbing requirements corresponding to each seed URL;

In this embodiment, the page content includes content such as page viewing times, unique accesses, and page residence time, and the page structure refers to a hierarchical relationship of each page.

For example, the grabbing requirement corresponding to the first seed URL is text, and the text content and the text structure in the first seed URL are analyzed to set the corresponding grabbing strategy as follows: starting from a starting page of a first seed URL, randomly selecting one URL to enter, and grabbing target text content layer by layer until grabbing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the picture content and the picture structure in the second seed URL are analyzed to set the corresponding grabbing strategy as follows: predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with larger similarity for grabbing.

In an alternative embodiment, said generating a target news information-crawling tree from said plurality of seed URLs comprises:

In this embodiment, the target news information capturing tree is generated according to the capturing nodes corresponding to the plurality of seed URLs and the referencing relationship between the capturing nodes of each seed URL, and specifically, the referencing relationship between the nodes of the target news information capturing tree may be preset, for example, may be preset according to the association degree between each seed URL and the target news information, or may be preset according to the association degree between the plurality of seed URLs.

In this embodiment, by creating a crawling policy for each seed URL, the crawling accuracy and efficiency of the target news information are improved, a target news information crawling tree is generated according to a plurality of seed URLs, crawling is performed from each node of the target news information crawling tree, the phenomenon of repeated crawling or crawling missing of the seed URLs is avoided, and the crawling accuracy of the target news information is improved.

And S13, starting a main thread to read the target seed URL and the corresponding grabbing strategy of each grabbing node in the target news information grabbing tree one by one.

In this embodiment, when the server receives the crawling request, the main thread is started to read the target seed URL of each crawling node and the corresponding crawling policy in the target news information crawling tree one by one, so that a phenomenon of missing or repeated reading in the process of reading the target seed URL is avoided, and the reading accuracy of the target seed UR L is improved.

S14, when the main thread is detected to read the target seed URLs with the preset number, starting a plurality of sub-threads, and distributing the target seed URLs with the preset number read by the main thread to the plurality of sub-threads according to a preset distribution rule.

In this embodiment, an allocation rule may be preset, and specifically, the preset allocation rule may be equal division, random allocation, or allocation according to a certain multiple.

In this embodiment, by starting multiple sub-threads at the same time, the waiting time of the main thread for reading data and starting the sub-threads can be saved, so that the processing efficiency of the server is further improved.

In other alternative embodiments, S14 may also be: when the main thread is detected to read the target seed URLs with the preset number, a sub-thread is correspondingly started, and the target seed URLs with the preset data are distributed to the sub-thread.

In this embodiment, the preset number is a preset threshold value of the promoter threads.

For example, assuming that the preset number is 10 ten thousand, when the main thread reads the target seed URL of the 10 th ten thousand grabbing node from the 1 st grabbing node, the server correspondingly starts a sub-thread; then, when the main thread reads the target seed URL of the 20 th ten thousand grabbing nodes from the 10 th ten thousand 1 grabbing nodes, the server correspondingly starts a sub-thread. That is, when the server detects that the main thread reads target seed URLs of a preset number of grabbing nodes, a sub-thread is started. By starting a plurality of sub-threads and utilizing the sub-threads to process a corresponding number of target seed URLs, the capturing speed of news information in the target seed URLs can be accelerated to a certain extent.

S15, controlling each sub-thread to open each target seed URL read by the main thread by using the Puppeterer, and performing grabbing processing.

In this embodiment, the puppeterer is a node. Js library, and provides a high-level API to control Chrome or Chromium, and specifically, the default operation mode of the puppeterer is headless, but may be configured in a non-headless mode.

In an alternative embodiment, the controlling each of the sub-threads to open each target seed URL read by the main thread using puppeterer, and performing the crawling process includes:

jumping to a target page corresponding to the target seed URL;

In this embodiment, since the puppeter can run on the linux server, a series of APIs are provided, convenience in calling APIs is ensured, and the purpleter is used to start the headless browser to open the target seed URL of each grabbing node.

S16, counting the grabbing results of the plurality of sub-threads through the main thread after the plurality of sub-threads are detected to finish grabbing processing, and obtaining the target grabbing results of the target news information.

In this embodiment, after detecting that the multiple sub-threads complete capturing of the target news information, capturing results of each sub-thread are obtained through the main thread.

In this embodiment, the capturing result may include, but is not limited to: grabbing successful data, grabbing abnormal data, grabbing identified data, grabbing identical picture data and grabbing identical text data.

In this embodiment, the main thread obtains a grabbing result of one sub thread, and stores the grabbing result in a cache of the server. And after all the sub-threads are grabbed, counting grabbing results of all the sub-threads by the main thread, and counting according to all the grabbing results to obtain a target grabbing result.

Further, the method further comprises:

detecting whether an abnormal event occurs to a child thread;

when detecting that an abnormal event occurs to a child thread, deleting the data after the child thread capturing processing of the abnormal event.

In this embodiment, by deleting the data after the capturing process of the sub-thread in which the abnormal event occurs, the accuracy of the target news information after the capturing process is ensured.

Further, the method further comprises:

Identifying a target grabbing node corresponding to the child thread with the abnormal event;

In some other optional embodiments, the verifying the target seed URL and the corresponding target crawling policy in the target crawling node includes:

matching a target seed URL in the target grabbing node with the seed URLs;

and when the target crawling policy is the crawling policy of the target seed URL, sending crawling suggestions to the client.

In the embodiment, by checking the target seed URL and the target grabbing strategy in the target grabbing node corresponding to the child thread with the abnormal event, the operation and maintenance personnel are assisted to rapidly analyze and grab the abnormal reason, and the working efficiency of the operation and maintenance personnel is improved.

In this embodiment, the crawling suggestion may be set according to the cause of the crawling anomaly, and in particular, the crawling suggestion may provide the crawling requirement for suggesting the client again or suggest the client to check whether the provided seed URL is in error. According to the embodiment, the capturing advice is sent to the client side, so that the client side is assisted in quickly making a decision, and the client experience and capturing efficiency are improved.

Further, the method further comprises:

when the target seed URL in the target grabbing node is not matched with any one of the seed URLs, correcting the target seed URL in the target grabbing node, and secondarily grabbing the corrected target seed URL.

Further, the method further comprises:

In this embodiment, the target seed URL and the capture policy in the abnormal target capture node are verified, and when the verification determines that the target seed URL and/or the capture policy are inconsistent, the target seed URL and/or the capture policy in the abnormal target capture node are corrected and then captured secondarily, thereby improving the integrity of captured target news information.

In summary, according to the news information capturing method in the embodiment, on one hand, each sub-thread is controlled to open the target seed URL of each target capturing node read by the main thread by using puppeterer, and capture processing is performed, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing processing of target news information is performed by starting a plurality of sub-threads by using puppeterer, so that the capturing efficiency of the target news information is improved; on the other hand, starting a main thread to read target seed URLs of target grabbing nodes in the seed URL target news information grabbing tree and corresponding grabbing strategies one by one, avoiding the phenomenon of omission or repeated reading in the process of reading the target seed URLs, and improving the reading accuracy of the target seed URLs; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the seed URLs, and improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategy for each seed URL, generating the target news information grabbing tree according to the seed URLs, grabbing from each node of the target news information grabbing tree, thereby avoiding the phenomenon of repeated grabbing or missing grabbing of the seed URLs and improving the grabbing accuracy of the target news information.

Example two

As shown in FIG. 2, the news information-capturing method specifically includes the following steps, and the order of the steps in the flowchart may be changed according to different requirements, and some may be omitted.

S21, analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs.

S22, creating a grabbing strategy for each seed URL, and associating each seed URL with the corresponding grabbing strategy.

acquiring grabbing requirements corresponding to each seed URL;

In this embodiment, by creating a capture policy for each seed URL, the capturing accuracy and efficiency of the target news information are improved.

S23, starting the target main thread to create a task queue, reading each associated seed URL, and sequentially sending the seed URL to the task queue.

In this embodiment, because each seed URL has an association relationship with a corresponding crawling policy, when a server receives a crawling request, the server starts a target main thread to create a task queue according to the crawling request, and sends each seed URL after association to the task queue in sequence.

In this embodiment, the capturing of the target news information is performed by creating the task queue, so that the phenomenon that each seed URL is repeatedly or omitted from capturing can be avoided, and the accuracy of capturing the target news information is improved.

S24, judging whether the same seed URL exists in the task queue.

In this embodiment, the task queue is judged whether to have the same seed URL, so as to avoid repeated grabbing of the same seed URL, thereby improving the grabbing accuracy and efficiency of the target news information.

S25, inquiring whether idle sub-threads exist in a plurality of target sub-threads started by the target main thread every preset period when the same seed URL does not exist in the task queue.

In this embodiment, a preset period may be preset, and specifically, the preset period may be 1 minute or 30 seconds. The idle sub-thread means that the sub-thread is currently free of tasks.

Further, the method further comprises:

and when the same seed URL exists in the task queue, continuing to read each associated seed URL and sequentially sending the seed URL to the task queue.

S26, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads.

In this embodiment, whether an idle sub-thread exists in a target sub-thread started by the target main thread is queried every preset period, and when the idle sub-thread exists in the target sub-thread, each seed URL read by the target main thread is distributed to the idle sub-thread, so that a phenomenon of uneven tasks in the target sub-thread is avoided, and the reading speed of each seed URL of each sub-thread is improved.

Further, the method further comprises:

when the idle sub-threads do not exist in the target sub-threads started by the target main thread, continuously inquiring whether the idle sub-threads exist in the target sub-threads started by the target main thread or not at preset intervals.

S27, controlling the idle sub-thread to open each seed URL read by the target main thread by using the Puppeterer, and grabbing target news information on each seed URL according to a corresponding grabbing strategy.

In an optional embodiment, the controlling the idle sub-thread to use puppeter to open each of the seed URLs read by the target main thread, and performing target news information crawling on each of the seed URLs according to a corresponding crawling policy includes:

starting a headless browser by using a Puppeterer to open each seed URL and a corresponding grabbing strategy read by the target main thread;

jumping to a target page corresponding to each seed URL;

and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to each seed URL.

In this embodiment, since the puppeter can run on the linux server, a series of APIs are provided, convenience in calling APIs is ensured, starting the headless browser to open each seed URL using the puppeter, compared with a non-headless browser, the rendering work of the real browser is reduced, the reading efficiency is improved, and the capturing efficiency of the target news information is improved by starting a plurality of target sub-threads to use the puppeter to capture the target news information.

And S28, counting the grabbing results of the idle sub-threads through the target main thread to obtain the grabbing results of the target news information after the idle sub-threads are detected to finish grabbing the target news information.

In this embodiment, after the idle sub-thread is detected to complete capturing of the target news information, a capturing result of each idle sub-thread is obtained through the target main thread.

In this embodiment, the target main thread acquires a grabbing result of an idle sub-thread, and stores the grabbing result in a cache of the server. And after all the idle sub-threads are grabbed, the target main thread counts the grabbing results of all the idle sub-threads, and counts according to all the grabbing results to obtain the target grabbing results.

Further, in the process of detecting the idle sub-thread to grab the target news information, the method further comprises the following steps:

detecting whether an abnormal event occurs to an idle sub-thread;

When detecting that an abnormal event occurs in the idle sub-thread, deleting the data after the idle sub-thread capturing processing of the abnormal event occurs.

In this embodiment, by deleting the data after the capturing process of the idle sub-thread in which the abnormal event occurs, the accuracy of the target news information after the capturing process is ensured.

In summary, according to the news information capturing method in this embodiment, on one hand, the idle sub-thread is controlled to use puppeterer to open each seed URL read by the target main thread, and target news information capturing is performed on each seed URL according to a corresponding capturing policy, and a headless browser is started to open each seed URL, so that compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is improved, and the capturing efficiency of target news information is improved by starting a plurality of target sub-threads to use puppeterer to perform capturing processing of target news information; on the other hand, a target main thread creating task queue is started, each seed URL after being read and correlated is sequentially sent to the task queue, capturing of target news information is carried out in a mode of creating the task queue, the phenomenon that each seed URL is repeatedly or is not captured can be avoided, and accuracy of capturing of the target news information is improved; finally, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads, inquiring whether idle sub-threads exist in the target sub-threads started by the target main thread every other preset period, and when the idle sub-threads exist in the target sub-threads, distributing each seed URL read by the target main thread to the idle sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.

Example III

In some embodiments, the news information-capturing device 30 may include a plurality of functional modules composed of program code segments. Program code for each program segment in the news information-capturing device 30 may be stored in a memory of the electronic device and executed by the at least one processor to perform the news information-capturing function (described in detail with reference to fig. 1).

In this embodiment, the news information-capturing device 30 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: the system comprises an analysis module 301, a generation module 302, a reading module 303, a starting module 304, a grabbing module 305, a statistics module 306 and an identification module 307. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.

The parsing module 301 is configured to parse the received crawling request of the target news information to obtain a plurality of seed URLs.

And the generating module 302 is configured to create a crawling policy for each seed URL, and generate a target news information crawling tree according to the plurality of seed URLs, where each crawling node of the target news information crawling tree includes a corresponding crawling policy.

In an alternative embodiment, the generating module 302 creates a crawling policy for each of the seed URLs includes:

acquiring grabbing requirements corresponding to each seed URL;

In an alternative embodiment, the generating module 302 generates the target news information-crawling tree according to the plurality of seed URLs includes:

And a reading module 303, configured to start the main line Cheng Zhuge to read the target seed URL and the corresponding crawling policy of each crawling node in the target news information crawling tree.

The starting module 304 is configured to start a plurality of sub-threads when detecting that the main thread reads a preset number of target seed URLs, and divide the preset number of target seed URLs read by the main thread into the plurality of sub-threads according to a preset allocation rule.

In other alternative embodiments, the initiation module 304: and the method is also used for correspondingly starting a sub-thread when the main thread is detected to read the target seed URLs with the preset number, and distributing the target seed URLs with the preset data to the sub-thread.

And the grabbing module 305 is used for controlling each sub-thread to open each target seed URL read by the main thread by using the puppeter, and performing grabbing processing.

In an alternative embodiment, the crawling module 305 controls each of the sub-threads to open each target seed URL read by the main thread using puppeterer, and performs crawling processing including:

jumping to a target page corresponding to the target seed URL;

And the statistics module 306 is configured to, after detecting that the plurality of sub-threads complete the capturing process, perform statistics on the capturing results of the plurality of sub-threads through the main thread to obtain a target capturing result of the target news information.

In this embodiment, the main thread acquires a grabbing result of one sub-thread, and stores the grabbing result in a cache of the server, and after all sub-threads are grabbed, the main thread counts grabbing results of all sub-threads, and counts according to all grabbing results to obtain a target grabbing result.

Further, in the process of detecting the plurality of sub-threads to grab the target news information, detecting whether an abnormal event occurs to the sub-threads; when detecting that an abnormal event occurs to a child thread, deleting the data after the child thread capturing processing of the abnormal event.

Further, an identifying module 307 is configured to identify a target grabbing node corresponding to the child thread that has an abnormal event; and verifying the target seed URL and the corresponding target grabbing strategy in the target grabbing node.

matching a target seed URL in the target grabbing node with the seed URLs;

Further, when the target seed URL in the target grabbing node is not matched with any one of the seed URLs, the target seed URL in the target grabbing node is revised, and the revised target seed URL is grabbed secondarily.

Further, when the target grabbing strategy is not the grabbing strategy of the target seed URL, the grabbing strategy in the target grabbing node is modified, and the target seed URL in the target grabbing node is secondarily grabbed according to the modified grabbing strategy.

In summary, according to the news information capturing device of the present embodiment, on one hand, each sub-thread is controlled to open the target seed URL of each target capturing node read by the main thread by using puppeterer, and capture processing is performed, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing processing of target news information is performed by starting a plurality of sub-threads by using puppeterer, so that the capturing efficiency of target news information is improved; on the other hand, starting a main thread to read target seed URLs of target grabbing nodes in the seed URL target news information grabbing tree and corresponding grabbing strategies one by one, avoiding the phenomenon of omission or repeated reading in the process of reading the target seed URLs, and improving the reading accuracy of the target seed URLs; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the seed URLs, and improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategy for each seed URL, generating the target news information grabbing tree according to the seed URLs, grabbing from each node of the target news information grabbing tree, thereby avoiding the phenomenon of repeated grabbing or missing grabbing of the seed URLs and improving the grabbing accuracy of the target news information.

Example IV

Fig. 4 is a block diagram of a news information-capturing device according to a fourth embodiment of the present invention.

In some embodiments, the news information-capturing device 40 may include a plurality of functional modules composed of program code segments. Program code for each of the program segments in the news information-capturing device 40 may be stored in a memory of the electronic device and executed by the at least one processor to perform the news information-capturing function (described in detail with reference to fig. 2).

In this embodiment, the news information-capturing device 40 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: parsing module 401, creating module 402, reading module 403, judging module 404, inquiring module 405, distributing module 406, grabbing module 407 and counting module 408. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.

The parsing module 401 is configured to parse the received crawling request of the target news information to obtain a plurality of seed URLs.

A creating module 402, configured to create a crawling policy for each of the seed URLs, and associate each of the seed URLs with a corresponding crawling policy.

In an alternative embodiment, the creating module 402 creates a crawling policy for each of the seed URLs includes:

Acquiring grabbing requirements corresponding to each seed URL;

And the reading module 403 is configured to start the target main thread to create a task queue, and read each associated seed URL and send each seed URL to the task queue in sequence.

And a judging module 404, configured to judge whether the same seed URL exists in the task queue.

Further, when the same seed URL exists in the task queue, continuing to read each associated seed URL and sequentially sending the seed URL to the task queue.

And the query module 405 is configured to query, when the same seed URL does not exist in the task queue, whether idle sub-threads exist in a plurality of target sub-threads started by the target main thread every preset period.

Further, when no idle sub-thread exists in the target sub-threads started by the target main thread, continuously inquiring whether the idle sub-thread exists in the target sub-threads started by the target main thread or not at preset intervals.

And the distributing module 406 is configured to, when there are idle sub-threads in the multiple target sub-threads started by the target main thread, distribute each seed URL read by the target main thread to the idle sub-threads.

And the grabbing module 407 is configured to control the idle sub-thread to open each seed URL read by the target main thread by using puppeter, and grab target news information for each seed URL according to a corresponding grabbing policy.

In an alternative embodiment, the crawling module 407 controls the idle sub-thread to open each of the seed URLs read by the target main thread using puppeter, and performs target news information crawling on each of the seed URLs according to a corresponding crawling policy, including:

jumping to a target page corresponding to each seed URL;

And the statistics module 408 is configured to, after detecting that the idle sub-thread completes capturing the target news information, obtain a capturing result of the target news information by performing statistics on the capturing result of the idle sub-thread by the target main thread.

Further, in the process of detecting the idle sub-thread to grab the target news information, detecting whether an abnormal event occurs to the idle sub-thread; when detecting that an abnormal event occurs in the idle sub-thread, deleting the data after the idle sub-thread capturing processing of the abnormal event occurs.

In summary, according to the news information capturing device in this embodiment, on one hand, the idle sub-thread is controlled to use puppeterer to open each seed URL read by the target main thread, and target news information capturing is performed on each seed URL according to a corresponding capturing policy, and a headless browser is started to open each seed URL, so that compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing efficiency of target news information is improved by starting a plurality of target sub-threads to use puppeterer to perform capturing processing of target news information; on the other hand, a target main thread creating task queue is started, each seed URL after being read and correlated is sequentially sent to the task queue, capturing of target news information is carried out in a mode of creating the task queue, the phenomenon that each seed URL is repeatedly or is not captured can be avoided, and accuracy of capturing of the target news information is improved; finally, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads, inquiring whether idle sub-threads exist in the target sub-threads started by the target main thread every other preset period, and when the idle sub-threads exist in the target sub-threads, distributing each seed URL read by the target main thread to the idle sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.

Example five

Fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In a preferred embodiment of the invention, the electronic device 5 comprises a memory 51, at least one processor 52, at least one communication bus 53 and a transceiver 54.

It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 5 is not limiting of the embodiments of the present invention, and that either a bus-type configuration or a star-type configuration may be used, and that the electronic device 5 may include more or less other hardware or software than that shown, or a different arrangement of components.

In some embodiments, the electronic device 5 is an electronic device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 5 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client by way of a keyboard, mouse, remote control, touch pad, or voice control device, such as a personal computer, tablet, smart phone, digital camera, etc.

It should be noted that the electronic device 5 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.

In some embodiments, the memory 51 is used to store program codes and various data, such as news information-capturing devices 30 or 40 installed in the electronic device 5, and to implement high-speed, automatic access to programs or data during operation of the electronic device 5. The Memory 51 includes Read-Only Memory (ROM), programmable Read-Only Memory (PROM), erasable programmable Read-Only Memory (EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable rewritable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic tape Memory, or any other medium that can be used for computer-readable carrying or storing data.

In some embodiments, the at least one processor 52 may be comprised of an integrated circuit, such as a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The at least one processor 52 is a Control Unit (Control Unit) of the electronic device 5, connects the respective components of the entire electronic device 5 using various interfaces and lines, and executes various functions of the electronic device 5 and processes data by running or executing programs or modules stored in the memory 51, and calling data stored in the memory 51.

In some embodiments, the at least one communication bus 53 is arranged to enable connected communication between the memory 51 and the at least one processor 52 or the like.

Although not shown, the electronic device 5 may further include a power source (such as a battery) for powering the various components, and optionally, the power source may be logically connected to the at least one processor 52 via a power management device, thereby performing functions such as managing charging, discharging, and power consumption via the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 5 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to perform portions of the methods described in the various embodiments of the invention.

In a further embodiment, in conjunction with fig. 3 or fig. 4, the at least one processor 52 may execute the operating device of the electronic device 5, as well as various installed applications (such as the news information-capturing device 30 or 40), program code, etc., such as the various modules described above.

The memory 51 has stored therein program code, and the at least one processor 52 can invoke the program code stored in the memory 51 to perform related functions. For example, each of the modules described in fig. 3 or fig. 4 is a program code stored in the memory 51 and executed by the at least one processor 52, thereby realizing the functions of each of the modules for the purpose of capturing news information.

In one embodiment of the present invention, the memory 51 stores a plurality of instructions that are executed by the at least one processor 52 to perform the function of news information crawling.

Specifically, the specific implementation method of the above instruction by the at least one processor 52 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1 or fig. 2, which is not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it will be obvious that the term "comprising" does not exclude other elements or that the singular does not exclude a plurality. The units or means stated in the invention may also be implemented by one unit or means, either by software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A news information capturing method, the method comprising:

2. The news information-crawling method of claim 1, wherein said creating a crawling policy for each of said seed URLs comprises:

acquiring grabbing requirements corresponding to each seed URL;

3. The news information-crawling method of claim 1, wherein the generating a target news information-crawling tree from the plurality of seed URLs comprises:

4. The news information-capturing method of claim 1, wherein controlling each of the sub-threads to open each target seed URL read by the main thread using puppeter, and performing capturing processing includes:

jumping to a target page corresponding to the target seed URL;

5. The news information-capturing method of claim 1, wherein the method further comprises:

detecting whether an abnormal event occurs to a child thread;

6. The news information-capturing method of claim 5, wherein the verifying the target seed URL and the corresponding target capturing policy in the target capturing node comprises:

matching a target seed URL in the target grabbing node with the seed URLs;

7. A news information capturing method, the method comprising:

judging whether the same seed URL exists in the task queue;

8. A news information-capturing apparatus, the apparatus comprising:

9. An electronic device comprising a processor and a memory, wherein the processor is configured to implement the news information-capturing method according to any one of claims 1 to 6 or claim 7 when executing a computer program stored in the memory.

10. A computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the news information-capturing method according to any one of claims 1 to 6 or claim 7.