CN113065055A - News information capturing method and device, electronic equipment and storage medium - Google Patents

News information capturing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113065055A
CN113065055A CN202110432611.3A CN202110432611A CN113065055A CN 113065055 A CN113065055 A CN 113065055A CN 202110432611 A CN202110432611 A CN 202110432611A CN 113065055 A CN113065055 A CN 113065055A
Authority
CN
China
Prior art keywords
target
grabbing
seed
news information
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110432611.3A
Other languages
Chinese (zh)
Other versions
CN113065055B (en
Inventor
郑德生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Saiante Technology Service Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202110432611.3A priority Critical patent/CN113065055B/en
Publication of CN113065055A publication Critical patent/CN113065055A/en
Application granted granted Critical
Publication of CN113065055B publication Critical patent/CN113065055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to the technical field of big data, and provides a news information capturing method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of seed URLs to generate a target news information capture tree; starting a main thread to read a target seed URL and a corresponding capture strategy of each capture node in a target news information capture tree; when the target seed URLs with the preset number are read, starting a plurality of sub-threads, and distributing the target seed URLs with the preset number to the plurality of sub-threads; controlling each sub-thread to open each target seed URL by using Puppeneer to perform grabbing processing; and counting the grabbing results of the plurality of sub threads through the main thread to obtain the target grabbing result of the target news information. According to the method, the headless browser is started by the Puppeneer to open the URL of each target seed, and a plurality of sub-threads are started to capture, so that the rendering work of a real browser is reduced, and the capture efficiency of the target news information is improved.

Description

News information capturing method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of big data, in particular to a news information capturing method and device, electronic equipment and a storage medium.
Background
According to the traditional news information grabbing, a crawler program is used for obtaining an http request corresponding to a URL of a website and analyzing a result returned by the http request, however, most of current news information webpages use ajax to obtain information content, the webpage content is rendered through Javascript, the traditional crawler cannot grab effective data or only grab partial effective data, and in addition, some programs grab the news information content through opening a browser and through positions of DOM elements.
However, since these programs must be run on the visual operating system, there is no way to run on the linux server, resulting in low efficiency and accuracy of capturing news information.
Therefore, it is necessary to provide a fast and accurate method for capturing news information.
Disclosure of Invention
In view of the above, it is necessary to provide a news information capturing method, apparatus, electronic device and storage medium, where Puppeteer is used to start a headless browser to open each target seed URL, and a plurality of sub-threads are started to perform capturing processing, so that rendering work of a real browser is reduced, and capturing efficiency of target news information is improved.
A first aspect of the present invention provides a news information capturing method, including:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
starting a main thread to read the target seed URL of each grabbing node in the target news information grabbing tree and the corresponding grabbing strategy one by one;
when detecting that the main thread reads a preset number of target seed URLs, starting a plurality of sub-threads, and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule;
controlling each sub-thread to open each target seed URL read by the main thread by using Puppeneer, and performing grabbing processing;
and when the fact that the plurality of sub-threads complete grabbing processing is detected, counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing result of the target news information.
Optionally, the creating a crawling policy for each seed URL includes:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring a grabbing requirement corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
Optionally, the generating a target news information crawling tree according to the plurality of seed URLs includes:
converting the capture node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information capture tree;
converting a reference relationship between the grabbing nodes of each seed URL in the plurality of seed URLs into edges between the nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree are used as the reference relationship between the nodes of the target news information grabbing tree;
and generating the target news information grabbing tree according to the nodes of the target news information grabbing tree and the edges between the nodes in the target news information grabbing tree.
Optionally, the controlling each sub-thread to open each target seed URL read by the main thread by using Puppeteer, and performing the crawling processing includes:
starting a headless browser by using Puppeteeer to open each target seed URL read by the main thread and a corresponding capture strategy;
skipping to a target page corresponding to the target seed URL;
and calling the Puppeneer to perform grabbing processing on the target page according to the grabbing strategy corresponding to the target seed URL.
Optionally, the method further comprises:
detecting whether a sub thread generates an abnormal event or not;
when detecting that a sub-thread generates an abnormal event, identifying a target capture node corresponding to the sub-thread generating the abnormal event;
and checking the target seed URL in the target grabbing node and the corresponding target grabbing strategy.
Optionally, the verifying the target seed URL in the target crawling node and the corresponding target crawling policy includes:
matching a target seed URL in the target grabbing node with the plurality of seed URLs;
when a target seed URL in the target grabbing node is matched with any one of the seed URLs, judging whether a target grabbing strategy is a grabbing strategy of the target seed URL;
when the target grabbing strategy is the grabbing strategy of the target seed URL, sending grabbing suggestions to a client; or
And when the target grabbing strategy is not the grabbing strategy of the target seed URL, correcting the grabbing strategy in the target grabbing node, and carrying out secondary grabbing on the target seed URL in the target grabbing node according to the corrected grabbing strategy.
A second aspect of the present invention provides a news information capturing method, including:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and associating each seed URL with a corresponding grabbing strategy;
starting a target main thread to create a task queue, reading each associated seed URL and sequentially sending the seed URL to the task queue;
judging whether the same seed URL exists in the task queue;
when the same seed URL does not exist in the task queue, inquiring whether idle sub threads exist in a plurality of target sub threads started by the target main thread every a preset period;
when an idle sub thread exists in a plurality of target sub threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub thread;
controlling the idle sub-thread to open each seed URL read by the target main thread by using Puppeneer, and capturing target news information of each seed URL according to a corresponding capturing strategy;
and when the idle sub-thread finishes capturing the target news information, counting the capturing result of the idle sub-thread through the target main thread to obtain the capturing result of the target news information.
A third aspect of the present invention provides a news information capturing apparatus, comprising:
the analysis module is used for analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
the generating module is used for creating a grabbing strategy for each seed URL and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
the reading module is used for starting a main thread to read the target seed URL of each grabbing node in the target news information grabbing tree and the corresponding grabbing strategy one by one;
the starting module is used for starting a plurality of sub-threads when the main thread reads a preset number of target seed URLs and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule;
the grabbing module is used for controlling each sub-thread to open each target seed URL read by the main thread by using Puppeneer and carrying out grabbing processing;
and the counting module is used for counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing result of the target news information after the fact that the grabbing processing of the plurality of sub-threads is detected to be completed.
A fourth aspect of the present invention provides an electronic device, which includes a processor and a memory, wherein the processor is configured to implement the news information crawling method when executing a computer program stored in the memory.
A fifth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the news information-capturing method.
In summary, according to the news information capture method, the device, the electronic device and the storage medium, on one hand, the subprocesses are controlled to open the target seed URL of each target capture node read by the main process by using Puppeteer, and capture processing is performed, so that compared with a non-headless browser, rendering work of a real browser is reduced, reading efficiency is improved, and capture processing of target news information is performed by starting a plurality of subprocesses and using Puppeteer, so that capture efficiency of the target news information is improved; on the other hand, the main thread is started to read the target seed URLs of the target capture nodes in the seed URL target news information capture tree one by one and corresponding capture strategies, so that the phenomenon of omission or repeated reading in the process of reading the target seed URLs is avoided, and the reading accuracy of the target seed UR L is improved; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the plurality of seed URLs, improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategies for each seed URL, generating the target news information grabbing tree according to the plurality of seed URLs, grabbing each node of the target news information grabbing tree, avoiding the phenomenon of repeatedly grabbing or missing grabbing of the seed URLs, and improving the grabbing accuracy of the target news information.
Drawings
Fig. 1 is a flowchart of a news information capturing method according to an embodiment of the present invention.
Fig. 2 is a flowchart of a news information capturing method according to a second embodiment of the present invention.
Fig. 3 is a structural diagram of a news information capturing apparatus according to a third embodiment of the present invention.
FIG. 4 is a block diagram of a news information capturing device according to a fourth embodiment of the present invention
Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.
Detailed Description
In order that the above objects, features and advantages of the present invention can be more clearly understood, a detailed description of the present invention will be given below with reference to the accompanying drawings and specific embodiments. It should be noted that the embodiments of the present invention and features of the embodiments may be combined with each other without conflict.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example one
Fig. 1 is a flowchart of a news information capturing method according to an embodiment of the present invention.
In this embodiment, the method for capturing news information may be applied to an electronic device, and for an electronic device that needs to capture news information, the method of the present invention may be directly integrated with the electronic device to provide the function of capturing news information, or may be operated in the electronic device in the form of a Software Development Kit (SDK).
As shown in fig. 1, the news information capturing method specifically includes the following steps, and the order of the steps in the flowchart may be changed and some steps may be omitted according to different requirements.
S11, analyzing the received target news information capture request to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capture request is initiated to the server through the client, specifically, the client may be a mobile phone, an IPAD, or other existing device with a sending function, the server may be a capture subsystem, during the capture process, for example, the client may send the capture request to the capture subsystem, and when the server receives the capture request sent by the client, the capture request is analyzed.
In this embodiment, the capture request includes information, such as a capture demand corresponding to the captured target news information, a seed URL, page content of the seed URL, and a page structure.
S12, creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy.
In this embodiment, since the page content, the page structure, and the capturing requirements corresponding to each seed URL are different, different capturing strategies are created for each seed URL.
In an optional embodiment, the creating a crawling policy for each seed URL includes:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring a grabbing requirement corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique access, page dwell time, and the like, and the page structure refers to a hierarchical relationship of each page.
Illustratively, the grabbing requirement corresponding to the first seed URL is a character, and analyzing the character content and the character structure in the first seed URL sets a corresponding grabbing policy as follows: starting from a start page of a first seed URL, randomly selecting a URL to enter, and capturing target character contents layer by layer until capturing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the corresponding grabbing strategy set by analyzing the picture content and the picture structure in the second seed URL is as follows: and predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with higher similarity for grabbing.
In an optional embodiment, the generating the target newsfeed tree from the plurality of seed URLs includes:
converting the capture node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information capture tree;
converting a reference relationship between the grabbing nodes of each seed URL in the plurality of seed URLs into edges between the nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree are used as the reference relationship between the nodes of the target news information grabbing tree;
and generating the target news information grabbing tree according to the nodes of the target news information grabbing tree and the edges between the nodes in the target news information grabbing tree.
In this embodiment, the target news information crawling tree is generated according to the crawling nodes corresponding to the plurality of seed URLs and the reference relationship between the crawling nodes of each seed URL, and specifically, the reference relationship between the nodes of the target news information crawling tree may be preset, for example, the reference relationship may be preset according to the association degree between each seed URL and the target news information, or the reference relationship between the plurality of seed URLs may be preset.
In the embodiment, the grabbing strategy is created for each seed URL, the grabbing accuracy and efficiency of the target news information are improved, the target news information grabbing tree is generated according to the seed URLs, each node of the target news information grabbing tree is grabbed, the phenomenon that the seed URLs are grabbed repeatedly or are missed is avoided, and the grabbing accuracy of the target news information is improved.
S13, the main thread is started to read the target seed URL and the corresponding capture strategy of each capture node in the target news information capture tree one by one.
In this embodiment, when the server receives the capture request, the main thread is started to read the target seed URL of each capture node in the target news information capture tree and the corresponding capture strategy one by one, so that missing or repeated reading in the process of reading the target seed URL is avoided, and the reading accuracy of the target seed UR L is improved.
S14, when detecting that the main thread reads a preset number of target seed URLs, starting a plurality of sub-threads, and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule.
In this embodiment, an allocation rule may be preset, and specifically, the preset allocation rule may be equal division, random allocation, or allocation according to a certain multiple.
In this embodiment, by simultaneously starting a plurality of sub-threads, the waiting time for the main thread to read data and start the sub-threads can be saved, and the processing efficiency of the server is further improved.
In other alternative embodiments, S14 may also be: and when detecting that the main thread reads a preset number of target seed URLs, correspondingly starting a sub-thread, and distributing the target seed URLs of the preset data to the sub-thread.
In this embodiment, the preset number is a preset critical value of the starter threads.
For example, assuming that the preset number is 10 thousands, when the main thread reads a target seed URL of a 10-ten-thousand capture node from a 1 st capture node, the server correspondingly starts a sub-thread; and then, when the main thread reads the target seed URL of the 20 ten thousand capture nodes from the 10 st-zeroth 1 st capture node, the server correspondingly starts a sub-thread. That is, the server starts a sub-thread each time the server detects that the main thread reads the target seed URLs of the preset number of the capture nodes. By starting a plurality of sub-threads and processing the target seed URLs with corresponding quantity by using the sub-threads, the capturing speed of the news information in the target seed URLs can be accelerated to a certain extent.
And S15, controlling each sub-thread to open each target seed URL read by the main thread by using Puppeneer, and performing grabbing processing.
In this embodiment, Puppeteer is a node. js library, and provides a high-level API to control Chrome or Chrome, and specifically, the default operating mode of Puppeteer is headless, but may be configured as a non-headless mode.
In an optional embodiment, the controlling each of the sub-threads to open each target seed URL read by the main thread by using Puppeteer, and the crawling processing includes:
starting a headless browser by using Puppeteeer to open each target seed URL read by the main thread and a corresponding capture strategy;
skipping to a target page corresponding to the target seed URL;
and calling the Puppeneer to perform grabbing processing on the target page according to the grabbing strategy corresponding to the target seed URL.
In this embodiment, because Puppeteer can operate on linux server, a series of APIs are provided, the convenience of calling APIs is ensured, the headless browser is started to open the target seed URL of each grabbing node by using the Puppeteer, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the target news information is grabbed by starting a plurality of sub-threads and using the Puppeteer, so that the grabbing efficiency of the target news information is improved.
And S16, when the fact that the plurality of sub-threads complete the grabbing processing is detected, counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing result of the target news information.
In this embodiment, when it is detected that the plurality of sub-threads complete the capturing of the target news information, the capturing result of each sub-thread is obtained through the main thread.
In this embodiment, the grabbing result may include, but is not limited to: and capturing successful data, capturing abnormal data, capturing identified data, capturing the same picture data and capturing the same character data.
In this embodiment, the main thread acquires the fetch result of one sub-thread, and stores the fetch result in the cache of the server. And after all the sub-threads are completely grabbed, the main thread counts the grabbing results of all the sub-threads, and statistics is carried out according to all the grabbing results to obtain a target grabbing result.
Further, the method further comprises:
detecting whether a sub thread generates an abnormal event or not;
and when detecting that the sub-thread generates the abnormal event, deleting the data after the capturing processing of the sub-thread generating the abnormal event.
In the embodiment, the accuracy of the captured target news information is ensured by deleting the data after the capturing process of the sub-thread with the abnormal event.
Further, the method further comprises:
identifying a target capture node corresponding to a sub thread with an abnormal event;
and checking the target seed URL in the target grabbing node and the corresponding target grabbing strategy.
In some other optional embodiments, the verifying the target seed URL and the corresponding target crawling policy in the target crawling node includes:
matching a target seed URL in the target grabbing node with the plurality of seed URLs;
when a target seed URL in the target grabbing node is matched with any one of the seed URLs, judging whether a target grabbing strategy is a grabbing strategy of the target seed URL;
and when the target grabbing strategy is the grabbing strategy of the target seed URL, sending grabbing suggestions to the client.
In the embodiment, the target seed URL in the target grabbing node corresponding to the sub thread with the abnormal event and the target grabbing strategy are verified, so that the operation and maintenance personnel are assisted to quickly analyze, grab and process the abnormal reason, and the working efficiency of the operation and maintenance personnel is improved.
In this embodiment, the crawling suggestion may be set according to a reason of the crawling abnormality, and specifically, the crawling suggestion may be to suggest that the client newly provides a crawling requirement or suggest that the client checks whether the provided seed URL is wrong. In the embodiment, the grabbing suggestion is sent to the client, so that the client is assisted to make a decision quickly, and the customer experience and the grabbing efficiency are improved.
Further, the method further comprises:
and when the target seed URL in the target grabbing node is not matched with any one of the seed URLs, correcting the target seed URL in the target grabbing node, and carrying out secondary grabbing on the corrected target seed URL.
Further, the method further comprises:
and when the target grabbing strategy is not the grabbing strategy of the target seed URL, correcting the grabbing strategy in the target grabbing node, and carrying out secondary grabbing on the target seed URL in the target grabbing node according to the corrected grabbing strategy.
In the embodiment, the target seed URL and the grabbing strategy in the abnormal target grabbing node are verified, and when the target seed URL and/or the grabbing strategy are determined to be inconsistent through verification, the target seed URL and/or the grabbing strategy in the abnormal target grabbing node are corrected and grabbed for the second time, so that the integrity of the grabbed target news information is improved.
In summary, in the news information capture method of this embodiment, on one hand, each sub-thread is controlled to use Puppeteer to open the target seed URL of each target capture node read by the main thread and perform capture processing, compared with a non-headless browser, rendering work of a real browser is reduced, reading efficiency is accelerated, and capture processing of target news information is performed by starting a plurality of sub-threads to use Puppeteer, so that capture efficiency of the target news information is improved; on the other hand, the main thread is started to read the target seed URLs of the target capture nodes in the seed URL target news information capture tree one by one and corresponding capture strategies, so that the phenomenon of omission or repeated reading in the process of reading the target seed URLs is avoided, and the reading accuracy of the target seed UR L is improved; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the plurality of seed URLs, improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategies for each seed URL, generating the target news information grabbing tree according to the plurality of seed URLs, grabbing each node of the target news information grabbing tree, avoiding the phenomenon of repeatedly grabbing or missing grabbing of the seed URLs, and improving the grabbing accuracy of the target news information.
Example two
Fig. 2 is a flowchart of a news information capturing method according to a second embodiment of the present invention.
In this embodiment, the method for capturing news information may be applied to an electronic device, and for an electronic device that needs to capture news information, the method of the present invention may be directly integrated with the electronic device to provide the function of capturing news information, or may be operated in the electronic device in the form of a Software Development Kit (SDK).
As shown in fig. 2, the news information capturing method specifically includes the following steps, and the order of the steps in the flowchart may be changed and some steps may be omitted according to different requirements.
S21, analyzing the received target news information capture request to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capture request is initiated to the server through the client, specifically, the client may be a mobile phone, an IPAD, or other existing device with a sending function, the server may be a capture subsystem, during the capture process, for example, the client may send the capture request to the capture subsystem, and when the server receives the capture request sent by the client, the capture request is analyzed.
In this embodiment, the capture request includes information, such as a capture demand corresponding to the captured target news information, a seed URL, page content of the seed URL, and a page structure.
S22, creating a grabbing strategy for each seed URL, and associating each seed URL with a corresponding grabbing strategy.
In this embodiment, since the page content, the page structure, and the capturing requirements corresponding to each seed URL are different, different capturing strategies are created for each seed URL.
In an optional embodiment, the creating a crawling policy for each seed URL includes:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring a grabbing requirement corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique access, page dwell time, and the like, and the page structure refers to a hierarchical relationship of each page.
Illustratively, the grabbing requirement corresponding to the first seed URL is a character, and analyzing the character content and the character structure in the first seed URL sets a corresponding grabbing policy as follows: starting from a start page of a first seed URL, randomly selecting a URL to enter, and capturing target character contents layer by layer until capturing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the corresponding grabbing strategy set by analyzing the picture content and the picture structure in the second seed URL is as follows: and predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with higher similarity for grabbing.
In the embodiment, the capturing strategy is established for each seed URL, so that the capturing accuracy and efficiency of the target news information are improved.
And S23, starting a target main thread to create a task queue, reading each associated seed URL, and sequentially sending the seed URL to the task queue.
In this embodiment, because each seed URL has an association relationship with a corresponding fetch policy, when a server receives a fetch request, a target main thread is started according to the fetch request to create a task queue, and each seed URL after association is sequentially sent to the task queue.
In the embodiment, the target news information is captured in the form of creating the task queue, so that the phenomenon that each seed URL is repeatedly captured or captured in a missing mode can be avoided, and the accuracy of capturing the target news information is improved.
And S24, judging whether the same seed URL exists in the task queue.
In this embodiment, whether the same seed URL exists in the task queue is determined, so that the same seed URL is prevented from being repeatedly captured, and the capturing accuracy and efficiency of the target news information are improved.
And S25, when the same seed URL does not exist in the task queue, inquiring whether idle sub threads exist in a plurality of target sub threads started by the target main thread every a preset period.
In this embodiment, a preset period may be preset, and specifically, the preset period may be 1 minute or 30 seconds. The idle sub thread means that the sub thread does not have a task at present.
Further, the method further comprises:
and when the same seed URL exists in the task queue, continuously reading each associated seed URL and sequentially sending the seed URL to the task queue.
And S26, when an idle sub thread exists in the plurality of target sub threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub thread.
In the embodiment, whether idle sub-threads exist in target sub-threads started by the target main thread is inquired at intervals of a preset period, and when the idle sub-threads exist in the target sub-threads, each seed URL read by the target main thread is distributed to the idle sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.
Further, the method further comprises:
when no idle sub-thread exists in the plurality of target sub-threads started by the target main thread, continuously inquiring whether an idle sub-thread exists in the plurality of target sub-threads started by the target main thread every a preset period.
And S27, controlling the idle sub-thread to open each seed URL read by the target main thread by using Puppeneer, and capturing target news information of each seed URL according to a corresponding capturing strategy.
In this embodiment, Puppeteer is a node. js library, and provides a high-level API to control Chrome or Chrome, and specifically, the default operating mode of Puppeteer is headless, but may be configured as a non-headless mode.
In an optional embodiment, the controlling the idle sub-thread to open each seed URL read by the target main thread by using Puppeteer, and performing target news information crawling on each seed URL according to a corresponding crawling policy includes:
starting a headless browser by using Puppeneer to open each seed URL read by the target main thread and a corresponding capture strategy;
jumping to a target page corresponding to each seed URL;
and calling the Puppeneer to perform grabbing processing on the target page according to the grabbing strategy corresponding to each seed URL.
In the embodiment, the Puppeneer can run on the linux server, a series of APIs are provided, the convenience of calling the APIs is guaranteed, the Puppeneer is used for starting the headless browser to open each seed URL, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, the Puppeneer is used for grabbing the target news information by starting a plurality of target sub-threads, and the grabbing efficiency of the target news information is improved.
S28, when it is detected that the idle sub-thread finishes capturing the target news information, the capturing result of the idle sub-thread is counted by the target main thread to obtain the capturing result of the target news information.
In this embodiment, when it is detected that the idle sub-thread finishes capturing the target news information, the capturing result of each idle sub-thread is obtained through the target main thread.
In this embodiment, the grabbing result may include, but is not limited to: and capturing successful data, capturing abnormal data, capturing identified data, capturing the same picture data and capturing the same character data.
In this embodiment, the target main thread acquires the fetch result of one idle sub-thread, and stores the fetch result in the cache of the server. And after all the idle sub-threads are completely grabbed, the target main thread counts the grabbing results of all the idle sub-threads, and statistics is carried out according to all the grabbing results to obtain a target grabbing result.
Further, in the process of detecting the idle sub-thread to capture the target news information, the method further comprises the following steps:
detecting whether an idle sub-thread generates an abnormal event or not;
and when detecting that the idle sub-thread generates an abnormal event, deleting the data after the capturing processing of the idle sub-thread generating the abnormal event.
In the embodiment, the accuracy of the captured target news information is ensured by deleting the data after the capturing process of the idle sub-thread with the abnormal event.
In summary, in the news information capture method according to this embodiment, on one hand, the idle sub-thread is controlled to open each seed URL read by the target main thread by using Puppeteer, and target news information capture is performed on each seed URL according to a corresponding capture strategy, and a headless browser is started by using the Puppeteer to open each seed URL, so that compared with a non-headless browser, rendering work of a real browser is reduced, reading efficiency is improved, and capture processing of target news information is performed by starting a plurality of target sub-threads by using the Puppeteer, so that capture efficiency of the target news information is improved; on the other hand, a target main thread is started to create a task queue, each associated seed URL is read and sequentially sent to the task queue, and target news information is captured in the form of creating the task queue, so that the phenomenon that each seed URL is repeatedly captured or captured in a missing mode can be avoided, and the accuracy of capturing the target news information is improved; and finally, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads, inquiring whether the idle sub-threads exist in the target sub-threads started by the target main thread at preset intervals, and distributing each seed URL read by the target main thread to the idle sub-threads when the idle sub-threads exist in the target sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.
EXAMPLE III
Fig. 3 is a structural diagram of a news information capturing apparatus according to a third embodiment of the present invention.
In some embodiments, the news information capturing device 30 may include a plurality of functional modules composed of program code segments. The program codes of the various program segments of the news information capture device 30 may be stored in a memory of the electronic device and executed by the at least one processor to perform the news information capture function (described in detail with reference to fig. 1).
In this embodiment, the news information capturing device 30 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: the device comprises a parsing module 301, a generating module 302, a reading module 303, a starting module 304, a grabbing module 305, a counting module 306 and an identifying module 307. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The parsing module 301 is configured to parse the received capture request of the target news information to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capture request is initiated to the server through the client, specifically, the client may be a mobile phone, an IPAD, or other existing device with a sending function, the server may be a capture subsystem, during the capture process, for example, the client may send the capture request to the capture subsystem, and when the server receives the capture request sent by the client, the capture request is analyzed.
In this embodiment, the capture request includes information, such as a capture demand corresponding to the captured target news information, a seed URL, page content of the seed URL, and a page structure.
A generating module 302, configured to create a capture policy for each seed URL, and generate a target news information capture tree according to the plurality of seed URLs, where each capture node of the target news information capture tree includes a corresponding capture policy.
In this embodiment, since the page content, the page structure, and the capturing requirements corresponding to each seed URL are different, different capturing strategies are created for each seed URL.
In an optional embodiment, the generating module 302 creating a crawling policy for each seed URL includes:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring a grabbing requirement corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique access, page dwell time, and the like, and the page structure refers to a hierarchical relationship of each page.
Illustratively, the grabbing requirement corresponding to the first seed URL is a character, and analyzing the character content and the character structure in the first seed URL sets a corresponding grabbing policy as follows: starting from a start page of a first seed URL, randomly selecting a URL to enter, and capturing target character contents layer by layer until capturing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the corresponding grabbing strategy set by analyzing the picture content and the picture structure in the second seed URL is as follows: and predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with higher similarity for grabbing.
In an alternative embodiment, the generating module 302 generating the target newsfeed tree according to the seed URLs includes:
converting the capture node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information capture tree;
converting a reference relationship between the grabbing nodes of each seed URL in the plurality of seed URLs into edges between the nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree are used as the reference relationship between the nodes of the target news information grabbing tree;
and generating the target news information grabbing tree according to the nodes of the target news information grabbing tree and the edges between the nodes in the target news information grabbing tree.
In this embodiment, the target news information crawling tree is generated according to the crawling nodes corresponding to the plurality of seed URLs and the reference relationship between the crawling nodes of each seed URL, and specifically, the reference relationship between the nodes of the target news information crawling tree may be preset, for example, the reference relationship may be preset according to the association degree between each seed URL and the target news information, or the reference relationship between the plurality of seed URLs may be preset.
In the embodiment, the grabbing strategy is created for each seed URL, the grabbing accuracy and efficiency of the target news information are improved, the target news information grabbing tree is generated according to the seed URLs, each node of the target news information grabbing tree is grabbed, the phenomenon that the seed URLs are grabbed repeatedly or are missed is avoided, and the grabbing accuracy of the target news information is improved.
The reading module 303 is configured to start a main thread to read the target seed URL of each capture node in the target news information capture tree and the corresponding capture policy one by one.
In this embodiment, when the server receives the capture request, the main thread is started to read the target seed URL of each capture node in the target news information capture tree and the corresponding capture strategy one by one, so that missing or repeated reading in the process of reading the target seed URL is avoided, and the reading accuracy of the target seed UR L is improved.
The starting module 304 is configured to start a plurality of sub-threads when it is detected that the main thread reads a preset number of target seed URLs, and distribute the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule.
In this embodiment, an allocation rule may be preset, and specifically, the preset allocation rule may be equal division, random allocation, or allocation according to a certain multiple.
In this embodiment, by simultaneously starting a plurality of sub-threads, the waiting time for the main thread to read data and start the sub-threads can be saved, and the processing efficiency of the server is further improved.
In other alternative embodiments, the initiation module 304: and the method is also used for correspondingly starting a sub-thread when the main thread reads the target seed URLs with preset quantity and distributing the target seed URLs with preset data to the sub-thread.
In this embodiment, the preset number is a preset critical value of the starter threads.
For example, assuming that the preset number is 10 thousands, when the main thread reads a target seed URL of a 10-ten-thousand capture node from a 1 st capture node, the server correspondingly starts a sub-thread; and then, when the main thread reads the target seed URL of the 20 ten thousand capture nodes from the 10 st-zeroth 1 st capture node, the server correspondingly starts a sub-thread. That is, the server starts a sub-thread each time the server detects that the main thread reads the target seed URLs of the preset number of the capture nodes. By starting a plurality of sub-threads and processing the target seed URLs with corresponding quantity by using the sub-threads, the capturing speed of the news information in the target seed URLs can be accelerated to a certain extent.
And the grabbing module 305 is configured to control each sub-thread to open each target seed URL read by the main thread by using Puppeteer, and perform grabbing processing.
In this embodiment, Puppeteer is a node. js library, and provides a high-level API to control Chrome or Chrome, and specifically, the default operating mode of Puppeteer is headless, but may be configured as a non-headless mode.
In an optional embodiment, the crawling module 305 controls each sub-thread to open each target seed URL read by the main thread by using Puppeteer, and performing the crawling process includes:
starting a headless browser by using Puppeteeer to open each target seed URL read by the main thread and a corresponding capture strategy;
skipping to a target page corresponding to the target seed URL;
and calling the Puppeneer to perform grabbing processing on the target page according to the grabbing strategy corresponding to the target seed URL.
In this embodiment, because Puppeteer can operate on linux server, a series of APIs are provided, the convenience of calling APIs is ensured, the headless browser is started to open the target seed URL of each grabbing node by using the Puppeteer, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the target news information is grabbed by starting a plurality of sub-threads and using the Puppeteer, so that the grabbing efficiency of the target news information is improved.
The counting module 306 is configured to count the grabbing results of the multiple sub-threads through the main thread to obtain a target grabbing result of the target news information after it is detected that the grabbing processing of the multiple sub-threads is completed.
In this embodiment, when it is detected that the plurality of sub-threads complete the capturing of the target news information, the capturing result of each sub-thread is obtained through the main thread.
In this embodiment, the grabbing result may include, but is not limited to: and capturing successful data, capturing abnormal data, capturing identified data, capturing the same picture data and capturing the same character data.
In this embodiment, the main thread acquires the grabbing result of one sub-thread, stores the grabbing result in the cache of the server, and after all the sub-threads grab the target, the main thread counts the grabbing results of all the sub-threads and performs statistics according to all the grabbing results to obtain the target grabbing result.
Further, in the process of detecting the target news information capture of the multiple sub-threads, whether the sub-threads generate abnormal events is detected; and when detecting that the sub-thread generates the abnormal event, deleting the data after the capturing processing of the sub-thread generating the abnormal event.
In the embodiment, the accuracy of the captured target news information is ensured by deleting the data after the capturing process of the sub-thread with the abnormal event.
Further, the identifying module 307 is configured to identify a target capture node corresponding to a child thread in which an abnormal event occurs; and checking the target seed URL in the target grabbing node and the corresponding target grabbing strategy.
In some other optional embodiments, the verifying the target seed URL and the corresponding target crawling policy in the target crawling node includes:
matching a target seed URL in the target grabbing node with the plurality of seed URLs;
when a target seed URL in the target grabbing node is matched with any one of the seed URLs, judging whether a target grabbing strategy is a grabbing strategy of the target seed URL;
and when the target grabbing strategy is the grabbing strategy of the target seed URL, sending grabbing suggestions to the client.
In the embodiment, the target seed URL in the target grabbing node corresponding to the sub thread with the abnormal event and the target grabbing strategy are verified, so that the operation and maintenance personnel are assisted to quickly analyze, grab and process the abnormal reason, and the working efficiency of the operation and maintenance personnel is improved.
In this embodiment, the crawling suggestion may be set according to a reason of the crawling abnormality, and specifically, the crawling suggestion may be to suggest that the client newly provides a crawling requirement or suggest that the client checks whether the provided seed URL is wrong. In the embodiment, the grabbing suggestion is sent to the client, so that the client is assisted to make a decision quickly, and the customer experience and the grabbing efficiency are improved.
Further, when the target seed URL in the target capture node is not matched with any one of the seed URLs, the target seed URL in the target capture node is corrected, and secondary capture is performed on the corrected target seed URL.
Further, when the target capture strategy is not the capture strategy of the target seed URL, the capture strategy in the target capture node is corrected, and the target seed URL in the target capture node is captured secondarily according to the corrected capture strategy.
In the embodiment, the target seed URL and the grabbing strategy in the abnormal target grabbing node are verified, and when the target seed URL and/or the grabbing strategy are determined to be inconsistent through verification, the target seed URL and/or the grabbing strategy in the abnormal target grabbing node are corrected and grabbed for the second time, so that the integrity of the grabbed target news information is improved.
In summary, in the news information capturing apparatus according to this embodiment, on one hand, each sub-thread is controlled to use Puppeteer to open the target seed URL of each target capturing node read by the main thread and perform capturing processing, compared with a non-headless browser, rendering work of a real browser is reduced, reading efficiency is accelerated, and capturing processing of target news information is performed by starting a plurality of sub-threads using Puppeteer, so that capturing efficiency of target news information is improved; on the other hand, the main thread is started to read the target seed URLs of the target capture nodes in the seed URL target news information capture tree one by one and corresponding capture strategies, so that the phenomenon of omission or repeated reading in the process of reading the target seed URLs is avoided, and the reading accuracy of the target seed UR L is improved; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the plurality of seed URLs, improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategies for each seed URL, generating the target news information grabbing tree according to the plurality of seed URLs, grabbing each node of the target news information grabbing tree, avoiding the phenomenon of repeatedly grabbing or missing grabbing of the seed URLs, and improving the grabbing accuracy of the target news information.
Example four
Fig. 4 is a structural diagram of a news information capturing apparatus according to a fourth embodiment of the present invention.
In some embodiments, the news information capturing device 40 may include a plurality of functional modules composed of program code segments. The program codes of the various program segments in the news information capture device 40 may be stored in a memory of the electronic equipment and executed by the at least one processor to perform the news information capture function (described in detail in fig. 2).
In this embodiment, the news information capturing device 40 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: the system comprises an analysis module 401, a creation module 402, a reading module 403, a judgment module 404, a query module 405, a distribution module 406, a grabbing module 407 and a statistics module 408. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.
The parsing module 401 is configured to parse the received capture request of the target news information to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capture request is initiated to the server through the client, specifically, the client may be a mobile phone, an IPAD, or other existing device with a sending function, the server may be a capture subsystem, during the capture process, for example, the client may send the capture request to the capture subsystem, and when the server receives the capture request sent by the client, the capture request is analyzed.
In this embodiment, the capture request includes information, such as a capture demand corresponding to the captured target news information, a seed URL, page content of the seed URL, and a page structure.
A creating module 402, configured to create a crawling policy for each seed URL, and associate each seed URL with a corresponding crawling policy.
In this embodiment, since the page content, the page structure, and the capturing requirements corresponding to each seed URL are different, different capturing strategies are created for each seed URL.
In an alternative embodiment, the creating module 402 creates a crawling policy for each seed URL including:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring a grabbing requirement corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique access, page dwell time, and the like, and the page structure refers to a hierarchical relationship of each page.
Illustratively, the grabbing requirement corresponding to the first seed URL is a character, and analyzing the character content and the character structure in the first seed URL sets a corresponding grabbing policy as follows: starting from a start page of a first seed URL, randomly selecting a URL to enter, and capturing target character contents layer by layer until capturing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the corresponding grabbing strategy set by analyzing the picture content and the picture structure in the second seed URL is as follows: and predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with higher similarity for grabbing.
In the embodiment, the capturing strategy is established for each seed URL, so that the capturing accuracy and efficiency of the target news information are improved.
The reading module 403 is configured to start a target main thread to create a task queue, and read each associated seed URL and send the seed URL to the task queue in sequence.
In this embodiment, because each seed URL has an association relationship with a corresponding fetch policy, when a server receives a fetch request, a target main thread is started according to the fetch request to create a task queue, and each seed URL after association is sequentially sent to the task queue.
In the embodiment, the target news information is captured in the form of creating the task queue, so that the phenomenon that each seed URL is repeatedly captured or captured in a missing mode can be avoided, and the accuracy of capturing the target news information is improved.
A judging module 404, configured to judge whether the same seed URL exists in the task queue.
In this embodiment, whether the same seed URL exists in the task queue is determined, so that the same seed URL is prevented from being repeatedly captured, and the capturing accuracy and efficiency of the target news information are improved.
Further, when the same seed URL exists in the task queue, each seed URL after association is continuously read and sequentially sent to the task queue.
The query module 405 is configured to query, every preset period, whether an idle sub-thread exists in the plurality of target sub-threads started by the target main thread when the same seed URL does not exist in the task queue.
In this embodiment, a preset period may be preset, and specifically, the preset period may be 1 minute or 30 seconds. The idle sub thread means that the sub thread does not have a task at present.
Further, when no idle sub-thread exists in the plurality of target sub-threads started by the target main thread, whether an idle sub-thread exists in the plurality of target sub-threads started by the target main thread is continuously inquired every preset period.
A distributing module 406, configured to, when there is an idle sub-thread in the multiple target sub-threads started by the target main thread, distribute each seed URL read by the target main thread to the idle sub-thread.
In the embodiment, whether idle sub-threads exist in target sub-threads started by the target main thread is inquired at intervals of a preset period, and when the idle sub-threads exist in the target sub-threads, each seed URL read by the target main thread is distributed to the idle sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.
And the capturing module 407 is configured to control the idle sub-thread to open each seed URL read by the target main thread by using Puppeteer, and perform target news information capturing on each seed URL according to a corresponding capturing policy.
In this embodiment, Puppeteer is a node. js library, and provides a high-level API to control Chrome or Chrome, and specifically, the default operating mode of Puppeteer is headless, but may be configured as a non-headless mode.
In an optional embodiment, the crawling module 407 controls the idle sub-thread to open each seed URL read by the target main thread by using Puppeteer, and performing target news information crawling on each seed URL according to a corresponding crawling policy includes:
starting a headless browser by using Puppeneer to open each seed URL read by the target main thread and a corresponding capture strategy;
jumping to a target page corresponding to each seed URL;
and calling the Puppeneer to perform grabbing processing on the target page according to the grabbing strategy corresponding to each seed URL.
In the embodiment, the Puppeneer can run on the linux server, a series of APIs are provided, the convenience of calling the APIs is guaranteed, the Puppeneer is used for starting the headless browser to open each seed URL, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, the Puppeneer is used for grabbing the target news information by starting a plurality of target sub-threads, and the grabbing efficiency of the target news information is improved.
The counting module 408 is configured to count the grabbing results of the idle sub-thread through the target main thread to obtain the grabbing results of the target news information after it is detected that the idle sub-thread finishes grabbing the target news information.
In this embodiment, when it is detected that the idle sub-thread finishes capturing the target news information, the capturing result of each idle sub-thread is obtained through the target main thread.
In this embodiment, the grabbing result may include, but is not limited to: and capturing successful data, capturing abnormal data, capturing identified data, capturing the same picture data and capturing the same character data.
In this embodiment, the target main thread acquires the fetch result of one idle sub-thread, and stores the fetch result in the cache of the server. And after all the idle sub-threads are completely grabbed, the target main thread counts the grabbing results of all the idle sub-threads, and statistics is carried out according to all the grabbing results to obtain a target grabbing result.
Further, in the process of detecting the idle sub-thread to capture the target news information, whether the idle sub-thread generates an abnormal event is detected; and when detecting that the idle sub-thread generates an abnormal event, deleting the data after the capturing processing of the idle sub-thread generating the abnormal event.
In the embodiment, the accuracy of the captured target news information is ensured by deleting the data after the capturing process of the idle sub-thread with the abnormal event.
In summary, in the news information capturing apparatus according to this embodiment, on one hand, the idle sub-thread is controlled to open each seed URL read by the target main thread using Puppeteer, and target news information capturing is performed on each seed URL according to a corresponding capturing policy, and a headless browser is started using Puppeteer to open each seed URL, so that compared with a non-headless browser, rendering work of a real browser is reduced, reading efficiency is improved, and capturing processing of target news information is performed by starting a plurality of target sub-threads using Puppeteer, so that capturing efficiency of target news information is improved; on the other hand, a target main thread is started to create a task queue, each associated seed URL is read and sequentially sent to the task queue, and target news information is captured in the form of creating the task queue, so that the phenomenon that each seed URL is repeatedly captured or captured in a missing mode can be avoided, and the accuracy of capturing the target news information is improved; and finally, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads, inquiring whether the idle sub-threads exist in the target sub-threads started by the target main thread at preset intervals, and distributing each seed URL read by the target main thread to the idle sub-threads when the idle sub-threads exist in the target sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.
EXAMPLE five
Fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the electronic device 5 comprises a memory 51, at least one processor 52, at least one communication bus 53 and a transceiver 54.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 5 does not constitute a limitation of the embodiment of the present invention, and may be a bus-type configuration or a star-type configuration, and the electronic device 5 may include more or less hardware or software than those shown, or different component arrangements.
In some embodiments, the electronic device 5 is an electronic device capable of automatically performing numerical calculation and/or information processing according to instructions set or stored in advance, and the hardware thereof includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 5 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client through a keyboard, a mouse, a remote controller, a touch pad, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.
It should be noted that the electronic device 5 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.
In some embodiments, the memory 51 is used for storing program codes and various data, such as the news information capturing apparatus 30 or 40 installed in the electronic device 5, and realizes high-speed and automatic access to programs or data during the operation of the electronic device 5. The Memory 51 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only disk (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.
In some embodiments, the at least one processor 52 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 52 is a Control Unit (Control Unit) of the electronic device 5, connects various components of the electronic device 5 by using various interfaces and lines, and executes various functions and processes data of the electronic device 5 by running or executing programs or modules stored in the memory 51 and calling data stored in the memory 51.
In some embodiments, the at least one communication bus 53 is arranged to enable connection communication between the memory 51 and the at least one processor 52, etc.
Although not shown, the electronic device 5 may further include a power source (such as a battery) for supplying power to each component, and optionally, the power source may be logically connected to the at least one processor 52 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 5 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, an electronic device, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.
In a further embodiment, in conjunction with fig. 3 or fig. 4, the at least one processor 52 may execute the operating device of the electronic device 5 and various installed application programs (such as the news information capturing device 30 or 40), program codes, and the like, for example, the above modules.
The memory 51 has program code stored therein, and the at least one processor 52 can call the program code stored in the memory 51 to perform related functions. For example, the modules shown in fig. 3 or fig. 4 are program codes stored in the memory 51 and executed by the at least one processor 52, so as to implement the functions of the modules for the purpose of capturing news information.
In one embodiment of the present invention, the memory 51 stores a plurality of instructions that are executed by the at least one processor 52 to implement news feed capture functionality.
Specifically, the method for implementing the instruction by the at least one processor 52 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 or fig. 2, which is not repeated herein.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the present invention may also be implemented by one unit or means through software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for capturing news information, the method comprising:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
starting a main thread to read the target seed URL of each grabbing node in the target news information grabbing tree and the corresponding grabbing strategy one by one;
when detecting that the main thread reads a preset number of target seed URLs, starting a plurality of sub-threads, and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule;
controlling each sub-thread to open each target seed URL read by the main thread by using Puppeneer, and performing grabbing processing;
and when the fact that the plurality of sub-threads complete grabbing processing is detected, counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing result of the target news information.
2. The method of claim 1, wherein creating a crawling policy for each seed URL comprises:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring a grabbing requirement corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
3. The method of claim 1, wherein the generating a target newsfeed tree based on the seed URLs comprises:
converting the capture node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information capture tree;
converting a reference relationship between the grabbing nodes of each seed URL in the plurality of seed URLs into edges between the nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree are used as the reference relationship between the nodes of the target news information grabbing tree;
and generating the target news information grabbing tree according to the nodes of the target news information grabbing tree and the edges between the nodes in the target news information grabbing tree.
4. A news information crawling method as claimed in claim 1, wherein the controlling of each of the sub-threads to open each target seed URL read by the main thread using Puppeteer and perform crawling processing includes:
starting a headless browser by using Puppeteeer to open each target seed URL read by the main thread and a corresponding capture strategy;
skipping to a target page corresponding to the target seed URL;
and calling the Puppeneer to perform grabbing processing on the target page according to the grabbing strategy corresponding to the target seed URL.
5. A method for capturing news information as claimed in claim 1, wherein the method further comprises:
detecting whether a sub thread generates an abnormal event or not;
when detecting that a sub-thread generates an abnormal event, identifying a target capture node corresponding to the sub-thread generating the abnormal event;
and checking the target seed URL in the target grabbing node and the corresponding target grabbing strategy.
6. The method of claim 5, wherein the verifying the target seed URL and the corresponding target crawling policy in the target crawling node comprises:
matching a target seed URL in the target grabbing node with the plurality of seed URLs;
when a target seed URL in the target grabbing node is matched with any one of the seed URLs, judging whether a target grabbing strategy is a grabbing strategy of the target seed URL;
when the target grabbing strategy is the grabbing strategy of the target seed URL, sending grabbing suggestions to a client; or
And when the target grabbing strategy is not the grabbing strategy of the target seed URL, correcting the grabbing strategy in the target grabbing node, and carrying out secondary grabbing on the target seed URL in the target grabbing node according to the corrected grabbing strategy.
7. A method for capturing news information, the method comprising:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and associating each seed URL with a corresponding grabbing strategy;
starting a target main thread to create a task queue, reading each associated seed URL and sequentially sending the seed URL to the task queue;
judging whether the same seed URL exists in the task queue;
when the same seed URL does not exist in the task queue, inquiring whether idle sub threads exist in a plurality of target sub threads started by the target main thread every a preset period;
when an idle sub thread exists in a plurality of target sub threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub thread;
controlling the idle sub-thread to open each seed URL read by the target main thread by using Puppeneer, and capturing target news information of each seed URL according to a corresponding capturing strategy;
and when the idle sub-thread finishes capturing the target news information, counting the capturing result of the idle sub-thread through the target main thread to obtain the capturing result of the target news information.
8. A news information capturing apparatus, comprising:
the analysis module is used for analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
the generating module is used for creating a grabbing strategy for each seed URL and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
the reading module is used for starting a main thread to read the target seed URL of each grabbing node in the target news information grabbing tree and the corresponding grabbing strategy one by one;
the starting module is used for starting a plurality of sub-threads when the main thread reads a preset number of target seed URLs and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule;
the grabbing module is used for controlling each sub-thread to open each target seed URL read by the main thread by using Puppeneer and carrying out grabbing processing;
and the counting module is used for counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing result of the target news information after the fact that the grabbing processing of the plurality of sub-threads is detected to be completed.
9. An electronic device, characterized in that the electronic device comprises a processor and a memory, the processor being configured to implement the news information crawling method according to any one of claims 1 to 6 or claim 7 when executing the computer program stored in the memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a news information-crawling method according to any one of claims 1 to 6 or claim 7.
CN202110432611.3A 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium Active CN113065055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110432611.3A CN113065055B (en) 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110432611.3A CN113065055B (en) 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113065055A true CN113065055A (en) 2021-07-02
CN113065055B CN113065055B (en) 2024-04-02

Family

ID=76567315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110432611.3A Active CN113065055B (en) 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113065055B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594756A (en) * 2023-07-17 2023-08-15 深圳市豪斯莱科技有限公司 Task processing method, device, terminal equipment and storage medium

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6351755B1 (en) * 1999-11-02 2002-02-26 Alta Vista Company System and method for associating an extensible set of data with documents downloaded by a web crawler
KR100875636B1 (en) * 2007-09-19 2008-12-26 한국과학기술정보연구원 Web crawler system based on grid computing, and method thereof
CN101689176A (en) * 2007-05-29 2010-03-31 怡斯福乐株式会社 Method for grasping information of web site through analyzing structure of web page
CN101715004A (en) * 2009-11-12 2010-05-26 中国科学院计算技术研究所 Internet video-oriented distributed acquisition method and system
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103902732A (en) * 2014-04-18 2014-07-02 北京大学 Construction and network resource collection method of self-adaption network resource collection system
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104731971A (en) * 2015-04-11 2015-06-24 淮阴工学院 Campus personalized palm service and user behavior habit analysis achieving method
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN109254908A (en) * 2018-08-03 2019-01-22 北京达佳互联信息技术有限公司 Visualize regression testing method, device, terminal device and readable storage medium storing program for executing
CN110569414A (en) * 2019-08-21 2019-12-13 时趣互动(北京)科技有限公司 puppeteeer-based website data collection method
CN110851682A (en) * 2019-10-17 2020-02-28 上海易点时空网络有限公司 Text anti-crawler method, server and display terminal
CN110851681A (en) * 2019-10-12 2020-02-28 平安科技(深圳)有限公司 Crawler processing method and device, server and computer readable storage medium
CN111552854A (en) * 2020-04-24 2020-08-18 北京明略软件***有限公司 Webpage data capturing method and device, storage medium and equipment
CN112068824A (en) * 2020-09-16 2020-12-11 杭州海康威视数字技术股份有限公司 Webpage development preview method and device and electronic equipment
CN112256984A (en) * 2020-10-22 2021-01-22 上海悦易网络信息技术有限公司 Method and device for acquiring interface background screenshot corresponding to webpage
US20210089579A1 (en) * 2019-09-23 2021-03-25 Arizona Board Of Regents On Behalf Of Arizona State University Method and apparatus for collecting, detecting and visualizing fake news
CN112637361A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Page proxy method, device, electronic equipment and storage medium

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6351755B1 (en) * 1999-11-02 2002-02-26 Alta Vista Company System and method for associating an extensible set of data with documents downloaded by a web crawler
CN101689176A (en) * 2007-05-29 2010-03-31 怡斯福乐株式会社 Method for grasping information of web site through analyzing structure of web page
KR100875636B1 (en) * 2007-09-19 2008-12-26 한국과학기술정보연구원 Web crawler system based on grid computing, and method thereof
CN101715004A (en) * 2009-11-12 2010-05-26 中国科学院计算技术研究所 Internet video-oriented distributed acquisition method and system
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103902732A (en) * 2014-04-18 2014-07-02 北京大学 Construction and network resource collection method of self-adaption network resource collection system
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104731971A (en) * 2015-04-11 2015-06-24 淮阴工学院 Campus personalized palm service and user behavior habit analysis achieving method
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN109254908A (en) * 2018-08-03 2019-01-22 北京达佳互联信息技术有限公司 Visualize regression testing method, device, terminal device and readable storage medium storing program for executing
CN110569414A (en) * 2019-08-21 2019-12-13 时趣互动(北京)科技有限公司 puppeteeer-based website data collection method
US20210089579A1 (en) * 2019-09-23 2021-03-25 Arizona Board Of Regents On Behalf Of Arizona State University Method and apparatus for collecting, detecting and visualizing fake news
CN110851681A (en) * 2019-10-12 2020-02-28 平安科技(深圳)有限公司 Crawler processing method and device, server and computer readable storage medium
CN110851682A (en) * 2019-10-17 2020-02-28 上海易点时空网络有限公司 Text anti-crawler method, server and display terminal
CN111552854A (en) * 2020-04-24 2020-08-18 北京明略软件***有限公司 Webpage data capturing method and device, storage medium and equipment
CN112068824A (en) * 2020-09-16 2020-12-11 杭州海康威视数字技术股份有限公司 Webpage development preview method and device and electronic equipment
CN112256984A (en) * 2020-10-22 2021-01-22 上海悦易网络信息技术有限公司 Method and device for acquiring interface background screenshot corresponding to webpage
CN112637361A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Page proxy method, device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邵晓文: "多线程并发网络爬虫的设计与实现", 《现代计算机(专业版)》, no. 1, pages 97 - 100 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594756A (en) * 2023-07-17 2023-08-15 深圳市豪斯莱科技有限公司 Task processing method, device, terminal equipment and storage medium
CN116594756B (en) * 2023-07-17 2023-11-03 深圳市豪斯莱科技有限公司 Task processing method, device, terminal equipment and storage medium

Also Published As

Publication number Publication date
CN113065055B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
US20150033233A1 (en) Job delay detection method and information processing apparatus
CN109033814B (en) Intelligent contract triggering method, device, equipment and storage medium
CN111949708A (en) Multi-task prediction method, device, equipment and medium based on time sequence feature extraction
CN111694843B (en) Missing number detection method and device, electronic equipment and storage medium
CN110851324A (en) Log-based routing inspection processing method and device, electronic equipment and storage medium
CN113065055A (en) News information capturing method and device, electronic equipment and storage medium
CN111459629A (en) Azkaban-based project operation method and device and terminal equipment
CN114416849A (en) Data processing method and device, electronic equipment and storage medium
CN113792146A (en) Text classification method and device based on artificial intelligence, electronic equipment and medium
CN113961338A (en) Management system and management method of dynamic thread pool and thread task processing method
CN112596809A (en) Visual configuration method and device of interface, electronic equipment and storage medium
CN116303320A (en) Real-time task management method, device, equipment and medium based on log file
CN115147031B (en) Clearing workflow execution method, device, equipment and medium
CN110765113A (en) Big data processing optimization method and device, terminal and storage medium
CN112181695A (en) Abnormal application processing method, device, server and storage medium
CN113268478A (en) Big data analysis method and device, electronic equipment and storage medium
WO2023272853A1 (en) Ai-based sql engine calling method and apparatus, and device and medium
CN114881313A (en) Behavior prediction method and device based on artificial intelligence and related equipment
CN114968505A (en) Task processing system, method, device, apparatus, storage medium, and program product
CN114124835A (en) Interface-based data transmission method, device, equipment and medium
CN110837399A (en) Method and device for managing streaming computing application program and computing equipment
CN113254728B (en) Task information display method and device, electronic equipment and storage medium
CN111199040B (en) Page tamper detection method, device, terminal and storage medium
WO2018071235A1 (en) Enhanced governance for asynchronous compute jobs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20211018

Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant after: Shenzhen saiante Technology Service Co.,Ltd.

Address before: 1-34 / F, Qianhai free trade building, 3048 Xinghai Avenue, Mawan, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong 518000

Applicant before: Ping An International Smart City Technology Co.,Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant