CN113065055B - News information capturing method and device, electronic equipment and storage medium - Google Patents

News information capturing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113065055B
CN113065055B CN202110432611.3A CN202110432611A CN113065055B CN 113065055 B CN113065055 B CN 113065055B CN 202110432611 A CN202110432611 A CN 202110432611A CN 113065055 B CN113065055 B CN 113065055B
Authority
CN
China
Prior art keywords
target
grabbing
news information
seed
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110432611.3A
Other languages
Chinese (zh)
Other versions
CN113065055A (en
Inventor
郑德生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Saiante Technology Service Co Ltd
Original Assignee
Shenzhen Saiante Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Saiante Technology Service Co Ltd filed Critical Shenzhen Saiante Technology Service Co Ltd
Priority to CN202110432611.3A priority Critical patent/CN113065055B/en
Publication of CN113065055A publication Critical patent/CN113065055A/en
Application granted granted Critical
Publication of CN113065055B publication Critical patent/CN113065055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to the technical field of big data, and provides a news information grabbing method, a news information grabbing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a plurality of seed URLs to generate a target news information capture tree; starting a main thread to read a target seed URL of each grabbing node in the target news information grabbing tree and a corresponding grabbing strategy; when a preset number of target seed URLs are read, starting a plurality of sub-threads, and dividing the preset number of target seed URLs into the plurality of sub-threads; controlling each sub-thread to open each target seed URL by using a Puppeterer to carry out grabbing processing; and counting the grabbing results of the plurality of sub-threads through the main thread to obtain target grabbing results of the target news information. According to the invention, the Puppeterer is used for starting the headless browser to open each target seed URL, and starting a plurality of sub-threads to carry out grabbing processing, so that the rendering work of the real browser is reduced, and the grabbing efficiency of target news information is improved.

Description

News information capturing method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of big data, in particular to a news information grabbing method, a news information grabbing device, electronic equipment and a storage medium.
Background
The traditional news information capturing is to acquire an http request corresponding to a URL of a website through a crawler program and analyze a result returned by the http request, but most news information web pages at present acquire information content through ajax, rendering of page content is realized through Javascript, the traditional crawler cannot capture effective data or can capture only part of the effective data, and in addition, some programs capture news information content through opening a browser and through the position of DOM elements.
However, since these programs must run on the visualized operating system, there is no way to run on the linux server, resulting in inefficiency and low accuracy of the captured news information.
Therefore, it is necessary to provide a quick and accurate news information capturing method.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a news information capturing method, a device, an electronic apparatus, and a storage medium, which start a headless browser to open each target seed URL by using a puppeter, and start a plurality of sub-threads to perform capturing processing, so that rendering work of a real browser is reduced, and capturing efficiency of target news information is improved.
A first aspect of the present invention provides a news information capturing method, the method including:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
starting a main thread to read target seed URLs of each grabbing node in the target news information grabbing tree one by one and corresponding grabbing strategies;
when the main thread is detected to read a preset number of target seed URLs, starting a plurality of sub-threads, and dividing the preset number of target seed URLs read by the main thread into the plurality of sub-threads according to a preset distribution rule;
controlling each sub-thread to open each target seed URL read by the main thread by using a Puppeterer, and performing grabbing processing;
and after detecting that the plurality of sub-threads finish grabbing processing, counting grabbing results of the plurality of sub-threads through the main thread to obtain target grabbing results of the target news information.
Optionally, the creating a crawling policy for each of the seed URLs includes:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring grabbing requirements corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
Optionally, the generating the target news information-capturing tree according to the plurality of seed URLs includes:
converting the grabbing node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information grabbing tree;
converting a reference relationship between grabbing nodes of each seed URL in the plurality of seed URLs into edges between nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree serve as the reference relationship between the nodes of the target news information grabbing tree;
and generating a target news information capture tree according to edges between the nodes of the target news information capture tree and the nodes in the target news information capture tree.
Optionally, the controlling each sub-thread to use puppeter to open each target seed URL read by the main thread, and performing the crawling processing includes:
Starting a headless browser to open each target seed URL read by the main thread and a corresponding grabbing strategy by using a Puppeterer;
jumping to a target page corresponding to the target seed URL;
and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to the target seed URL.
Optionally, the method further comprises:
detecting whether an abnormal event occurs to a child thread;
when detecting that an abnormal event occurs to a child thread, identifying a target grabbing node corresponding to the child thread with the abnormal event;
and verifying the target seed URL and the corresponding target grabbing strategy in the target grabbing node.
Optionally, the verifying the target seed URL and the corresponding target crawling policy in the target crawling node includes:
matching a target seed URL in the target grabbing node with the seed URLs;
when the target seed URL in the target grabbing node is matched with any one seed URL in the plurality of seed URLs, judging whether the target grabbing strategy is the grabbing strategy of the target seed URL;
when the target grabbing strategy is the grabbing strategy of the target seed URL, sending grabbing suggestions to a client; or alternatively
When the target grabbing strategy is not the grabbing strategy of the target seed URL, correcting the grabbing strategy in the target grabbing node, and secondarily grabbing the target seed URL in the target grabbing node according to the corrected grabbing strategy.
A second aspect of the present invention provides a news information-capturing method, the method including:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and associating each seed URL with a corresponding grabbing strategy;
starting a target main thread to create a task queue, reading each associated seed URL and sequentially sending the seed URL to the task queue;
judging whether the same seed URL exists in the task queue;
when the same seed URL does not exist in the task queue, inquiring whether idle sub-threads exist in a plurality of target sub-threads started by the target main thread or not at intervals of a preset period;
when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads;
Controlling the idle sub-thread to open each seed URL read by the target main thread by using a Puppeterer, and capturing target news information for each seed URL according to a corresponding capturing strategy;
and counting the grabbing results of the idle sub-threads through the target main thread after the idle sub-threads are detected to finish grabbing target news information, so as to obtain grabbing results of the target news information.
A third aspect of the present invention provides a news information-capturing device, the device including:
the analysis module is used for analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
the generation module is used for creating a grabbing strategy for each seed URL and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
the reading module is used for starting the main line Cheng Zhuge to read the target seed URL and the corresponding grabbing strategy of each grabbing node in the target news information grabbing tree;
the starting module is used for starting a plurality of sub-threads when detecting that the main thread reads a preset number of target seed URLs, and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule;
The grabbing module is used for controlling each sub-thread to open each target seed URL read by the main thread by using the Puppeterer and carrying out grabbing processing;
and the statistics module is used for counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing results of the target news information after the grabbing processing of the plurality of sub-threads is detected.
A fourth aspect of the present invention provides an electronic device comprising a processor and a memory, the processor being configured to implement the news information-capturing method when executing a computer program stored in the memory.
A fifth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the news information-capturing method.
In summary, according to the news information capturing method, device, electronic equipment and storage medium of the present invention, on one hand, each sub-thread is controlled to use puppeter to open the target seed URL of each target capturing node read by the main thread and perform capturing processing, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing processing of target news information is performed by starting a plurality of sub-threads and using puppeter, so that the capturing efficiency of target news information is improved; on the other hand, starting a main thread to read target seed URLs of target grabbing nodes in the seed URL target news information grabbing tree and corresponding grabbing strategies one by one, avoiding the phenomenon of omission or repeated reading in the process of reading the target seed URLs, and improving the reading accuracy of the target seed URLs; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the seed URLs, and improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategy for each seed URL, generating the target news information grabbing tree according to the seed URLs, grabbing from each node of the target news information grabbing tree, thereby avoiding the phenomenon of repeated grabbing or missing grabbing of the seed URLs and improving the grabbing accuracy of the target news information.
Drawings
Fig. 1 is a flowchart of a news information capturing method according to an embodiment of the invention.
Fig. 2 is a flowchart of a news information capturing method according to a second embodiment of the present invention.
Fig. 3 is a block diagram of a news information-capturing device according to a third embodiment of the present invention.
FIG. 4 is a block diagram of a news information-capturing device according to a fourth embodiment of the present invention
Fig. 5 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present invention and features in the embodiments may be combined with each other.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.
Example 1
Fig. 1 is a flowchart of a news information capturing method according to an embodiment of the invention.
In this embodiment, the news information capturing method may be applied to an electronic device, and for an electronic device that needs to capture news information, the news information capturing function provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in the form of a software development kit (Software Development Kit, SDK).
As shown in FIG. 1, the news information capturing method specifically includes the following steps, and the order of the steps in the flowchart may be changed according to different requirements, and some may be omitted.
S11, analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capturing request is initiated to a server through a client, specifically, the client may be a mobile phone, an IPAD or other existing devices with a sending function, the server may be a capturing subsystem, in the capturing process, for example, the client may send the capturing request to the capturing subsystem, and when the server receives the capturing request sent by the client, the capturing request is analyzed.
In this embodiment, the crawling request includes crawling requirements, a seed URL, page content of the seed URL, page structure, and the like corresponding to the crawling target news information.
S12, creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy.
In this embodiment, since the page content, the page structure, and the crawling requirements corresponding to each seed URL are different, a different crawling policy is created for each seed URL.
In an alternative embodiment, said creating a crawling policy for each of said seed URLs comprises:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring grabbing requirements corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique accesses, and page residence time, and the page structure refers to a hierarchical relationship of each page.
For example, the grabbing requirement corresponding to the first seed URL is text, and the text content and the text structure in the first seed URL are analyzed to set the corresponding grabbing strategy as follows: starting from a starting page of a first seed URL, randomly selecting one URL to enter, and grabbing target text content layer by layer until grabbing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the picture content and the picture structure in the second seed URL are analyzed to set the corresponding grabbing strategy as follows: predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with larger similarity for grabbing.
In an alternative embodiment, said generating a target news information-crawling tree from said plurality of seed URLs comprises:
converting the grabbing node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information grabbing tree;
converting a reference relationship between grabbing nodes of each seed URL in the plurality of seed URLs into edges between nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree serve as the reference relationship between the nodes of the target news information grabbing tree;
And generating a target news information capture tree according to edges between the nodes of the target news information capture tree and the nodes in the target news information capture tree.
In this embodiment, the target news information capturing tree is generated according to the capturing nodes corresponding to the plurality of seed URLs and the referencing relationship between the capturing nodes of each seed URL, and specifically, the referencing relationship between the nodes of the target news information capturing tree may be preset, for example, may be preset according to the association degree between each seed URL and the target news information, or may be preset according to the association degree between the plurality of seed URLs.
In this embodiment, by creating a crawling policy for each seed URL, the crawling accuracy and efficiency of the target news information are improved, a target news information crawling tree is generated according to a plurality of seed URLs, crawling is performed from each node of the target news information crawling tree, the phenomenon of repeated crawling or crawling missing of the seed URLs is avoided, and the crawling accuracy of the target news information is improved.
And S13, starting a main thread to read the target seed URL and the corresponding grabbing strategy of each grabbing node in the target news information grabbing tree one by one.
In this embodiment, when the server receives the crawling request, the main thread is started to read the target seed URL of each crawling node and the corresponding crawling policy in the target news information crawling tree one by one, so that a phenomenon of missing or repeated reading in the process of reading the target seed URL is avoided, and the reading accuracy of the target seed UR L is improved.
S14, when the main thread is detected to read the target seed URLs with the preset number, starting a plurality of sub-threads, and distributing the target seed URLs with the preset number read by the main thread to the plurality of sub-threads according to a preset distribution rule.
In this embodiment, an allocation rule may be preset, and specifically, the preset allocation rule may be equal division, random allocation, or allocation according to a certain multiple.
In this embodiment, by starting multiple sub-threads at the same time, the waiting time of the main thread for reading data and starting the sub-threads can be saved, so that the processing efficiency of the server is further improved.
In other alternative embodiments, S14 may also be: when the main thread is detected to read the target seed URLs with the preset number, a sub-thread is correspondingly started, and the target seed URLs with the preset data are distributed to the sub-thread.
In this embodiment, the preset number is a preset threshold value of the promoter threads.
For example, assuming that the preset number is 10 ten thousand, when the main thread reads the target seed URL of the 10 th ten thousand grabbing node from the 1 st grabbing node, the server correspondingly starts a sub-thread; then, when the main thread reads the target seed URL of the 20 th ten thousand grabbing nodes from the 10 th ten thousand 1 grabbing nodes, the server correspondingly starts a sub-thread. That is, when the server detects that the main thread reads target seed URLs of a preset number of grabbing nodes, a sub-thread is started. By starting a plurality of sub-threads and utilizing the sub-threads to process a corresponding number of target seed URLs, the capturing speed of news information in the target seed URLs can be accelerated to a certain extent.
S15, controlling each sub-thread to open each target seed URL read by the main thread by using the Puppeterer, and performing grabbing processing.
In this embodiment, the puppeterer is a node. Js library, and provides a high-level API to control Chrome or Chromium, and specifically, the default operation mode of the puppeterer is headless, but may be configured in a non-headless mode.
In an alternative embodiment, the controlling each of the sub-threads to open each target seed URL read by the main thread using puppeterer, and performing the crawling process includes:
starting a headless browser to open each target seed URL read by the main thread and a corresponding grabbing strategy by using a Puppeterer;
jumping to a target page corresponding to the target seed URL;
and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to the target seed URL.
In this embodiment, since the puppeter can run on the linux server, a series of APIs are provided, convenience in calling APIs is ensured, and the purpleter is used to start the headless browser to open the target seed URL of each grabbing node.
S16, counting the grabbing results of the plurality of sub-threads through the main thread after the plurality of sub-threads are detected to finish grabbing processing, and obtaining the target grabbing results of the target news information.
In this embodiment, after detecting that the multiple sub-threads complete capturing of the target news information, capturing results of each sub-thread are obtained through the main thread.
In this embodiment, the capturing result may include, but is not limited to: grabbing successful data, grabbing abnormal data, grabbing identified data, grabbing identical picture data and grabbing identical text data.
In this embodiment, the main thread obtains a grabbing result of one sub thread, and stores the grabbing result in a cache of the server. And after all the sub-threads are grabbed, counting grabbing results of all the sub-threads by the main thread, and counting according to all the grabbing results to obtain a target grabbing result.
Further, the method further comprises:
detecting whether an abnormal event occurs to a child thread;
when detecting that an abnormal event occurs to a child thread, deleting the data after the child thread capturing processing of the abnormal event.
In this embodiment, by deleting the data after the capturing process of the sub-thread in which the abnormal event occurs, the accuracy of the target news information after the capturing process is ensured.
Further, the method further comprises:
Identifying a target grabbing node corresponding to the child thread with the abnormal event;
and verifying the target seed URL and the corresponding target grabbing strategy in the target grabbing node.
In some other optional embodiments, the verifying the target seed URL and the corresponding target crawling policy in the target crawling node includes:
matching a target seed URL in the target grabbing node with the seed URLs;
when the target seed URL in the target grabbing node is matched with any one seed URL in the plurality of seed URLs, judging whether the target grabbing strategy is the grabbing strategy of the target seed URL;
and when the target crawling policy is the crawling policy of the target seed URL, sending crawling suggestions to the client.
In the embodiment, by checking the target seed URL and the target grabbing strategy in the target grabbing node corresponding to the child thread with the abnormal event, the operation and maintenance personnel are assisted to rapidly analyze and grab the abnormal reason, and the working efficiency of the operation and maintenance personnel is improved.
In this embodiment, the crawling suggestion may be set according to the cause of the crawling anomaly, and in particular, the crawling suggestion may provide the crawling requirement for suggesting the client again or suggest the client to check whether the provided seed URL is in error. According to the embodiment, the capturing advice is sent to the client side, so that the client side is assisted in quickly making a decision, and the client experience and capturing efficiency are improved.
Further, the method further comprises:
when the target seed URL in the target grabbing node is not matched with any one of the seed URLs, correcting the target seed URL in the target grabbing node, and secondarily grabbing the corrected target seed URL.
Further, the method further comprises:
when the target grabbing strategy is not the grabbing strategy of the target seed URL, correcting the grabbing strategy in the target grabbing node, and secondarily grabbing the target seed URL in the target grabbing node according to the corrected grabbing strategy.
In this embodiment, the target seed URL and the capture policy in the abnormal target capture node are verified, and when the verification determines that the target seed URL and/or the capture policy are inconsistent, the target seed URL and/or the capture policy in the abnormal target capture node are corrected and then captured secondarily, thereby improving the integrity of captured target news information.
In summary, according to the news information capturing method in the embodiment, on one hand, each sub-thread is controlled to open the target seed URL of each target capturing node read by the main thread by using puppeterer, and capture processing is performed, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing processing of target news information is performed by starting a plurality of sub-threads by using puppeterer, so that the capturing efficiency of the target news information is improved; on the other hand, starting a main thread to read target seed URLs of target grabbing nodes in the seed URL target news information grabbing tree and corresponding grabbing strategies one by one, avoiding the phenomenon of omission or repeated reading in the process of reading the target seed URLs, and improving the reading accuracy of the target seed URLs; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the seed URLs, and improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategy for each seed URL, generating the target news information grabbing tree according to the seed URLs, grabbing from each node of the target news information grabbing tree, thereby avoiding the phenomenon of repeated grabbing or missing grabbing of the seed URLs and improving the grabbing accuracy of the target news information.
Example two
Fig. 2 is a flowchart of a news information capturing method according to a second embodiment of the present invention.
In this embodiment, the news information capturing method may be applied to an electronic device, and for an electronic device that needs to capture news information, the news information capturing function provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in the form of a software development kit (Software Development Kit, SDK).
As shown in FIG. 2, the news information-capturing method specifically includes the following steps, and the order of the steps in the flowchart may be changed according to different requirements, and some may be omitted.
S21, analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capturing request is initiated to a server through a client, specifically, the client may be a mobile phone, an IPAD or other existing devices with a sending function, the server may be a capturing subsystem, in the capturing process, for example, the client may send the capturing request to the capturing subsystem, and when the server receives the capturing request sent by the client, the capturing request is analyzed.
In this embodiment, the crawling request includes crawling requirements, a seed URL, page content of the seed URL, page structure, and the like corresponding to the crawling target news information.
S22, creating a grabbing strategy for each seed URL, and associating each seed URL with the corresponding grabbing strategy.
In this embodiment, since the page content, the page structure, and the crawling requirements corresponding to each seed URL are different, a different crawling policy is created for each seed URL.
In an alternative embodiment, said creating a crawling policy for each of said seed URLs comprises:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring grabbing requirements corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique accesses, and page residence time, and the page structure refers to a hierarchical relationship of each page.
For example, the grabbing requirement corresponding to the first seed URL is text, and the text content and the text structure in the first seed URL are analyzed to set the corresponding grabbing strategy as follows: starting from a starting page of a first seed URL, randomly selecting one URL to enter, and grabbing target text content layer by layer until grabbing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the picture content and the picture structure in the second seed URL are analyzed to set the corresponding grabbing strategy as follows: predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with larger similarity for grabbing.
In this embodiment, by creating a capture policy for each seed URL, the capturing accuracy and efficiency of the target news information are improved.
S23, starting the target main thread to create a task queue, reading each associated seed URL, and sequentially sending the seed URL to the task queue.
In this embodiment, because each seed URL has an association relationship with a corresponding crawling policy, when a server receives a crawling request, the server starts a target main thread to create a task queue according to the crawling request, and sends each seed URL after association to the task queue in sequence.
In this embodiment, the capturing of the target news information is performed by creating the task queue, so that the phenomenon that each seed URL is repeatedly or omitted from capturing can be avoided, and the accuracy of capturing the target news information is improved.
S24, judging whether the same seed URL exists in the task queue.
In this embodiment, the task queue is judged whether to have the same seed URL, so as to avoid repeated grabbing of the same seed URL, thereby improving the grabbing accuracy and efficiency of the target news information.
S25, inquiring whether idle sub-threads exist in a plurality of target sub-threads started by the target main thread every preset period when the same seed URL does not exist in the task queue.
In this embodiment, a preset period may be preset, and specifically, the preset period may be 1 minute or 30 seconds. The idle sub-thread means that the sub-thread is currently free of tasks.
Further, the method further comprises:
and when the same seed URL exists in the task queue, continuing to read each associated seed URL and sequentially sending the seed URL to the task queue.
S26, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads.
In this embodiment, whether an idle sub-thread exists in a target sub-thread started by the target main thread is queried every preset period, and when the idle sub-thread exists in the target sub-thread, each seed URL read by the target main thread is distributed to the idle sub-thread, so that a phenomenon of uneven tasks in the target sub-thread is avoided, and the reading speed of each seed URL of each sub-thread is improved.
Further, the method further comprises:
when the idle sub-threads do not exist in the target sub-threads started by the target main thread, continuously inquiring whether the idle sub-threads exist in the target sub-threads started by the target main thread or not at preset intervals.
S27, controlling the idle sub-thread to open each seed URL read by the target main thread by using the Puppeterer, and grabbing target news information on each seed URL according to a corresponding grabbing strategy.
In this embodiment, the puppeterer is a node. Js library, and provides a high-level API to control Chrome or Chromium, and specifically, the default operation mode of the puppeterer is headless, but may be configured in a non-headless mode.
In an optional embodiment, the controlling the idle sub-thread to use puppeter to open each of the seed URLs read by the target main thread, and performing target news information crawling on each of the seed URLs according to a corresponding crawling policy includes:
starting a headless browser by using a Puppeterer to open each seed URL and a corresponding grabbing strategy read by the target main thread;
jumping to a target page corresponding to each seed URL;
and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to each seed URL.
In this embodiment, since the puppeter can run on the linux server, a series of APIs are provided, convenience in calling APIs is ensured, starting the headless browser to open each seed URL using the puppeter, compared with a non-headless browser, the rendering work of the real browser is reduced, the reading efficiency is improved, and the capturing efficiency of the target news information is improved by starting a plurality of target sub-threads to use the puppeter to capture the target news information.
And S28, counting the grabbing results of the idle sub-threads through the target main thread to obtain the grabbing results of the target news information after the idle sub-threads are detected to finish grabbing the target news information.
In this embodiment, after the idle sub-thread is detected to complete capturing of the target news information, a capturing result of each idle sub-thread is obtained through the target main thread.
In this embodiment, the capturing result may include, but is not limited to: grabbing successful data, grabbing abnormal data, grabbing identified data, grabbing identical picture data and grabbing identical text data.
In this embodiment, the target main thread acquires a grabbing result of an idle sub-thread, and stores the grabbing result in a cache of the server. And after all the idle sub-threads are grabbed, the target main thread counts the grabbing results of all the idle sub-threads, and counts according to all the grabbing results to obtain the target grabbing results.
Further, in the process of detecting the idle sub-thread to grab the target news information, the method further comprises the following steps:
detecting whether an abnormal event occurs to an idle sub-thread;
When detecting that an abnormal event occurs in the idle sub-thread, deleting the data after the idle sub-thread capturing processing of the abnormal event occurs.
In this embodiment, by deleting the data after the capturing process of the idle sub-thread in which the abnormal event occurs, the accuracy of the target news information after the capturing process is ensured.
In summary, according to the news information capturing method in this embodiment, on one hand, the idle sub-thread is controlled to use puppeterer to open each seed URL read by the target main thread, and target news information capturing is performed on each seed URL according to a corresponding capturing policy, and a headless browser is started to open each seed URL, so that compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is improved, and the capturing efficiency of target news information is improved by starting a plurality of target sub-threads to use puppeterer to perform capturing processing of target news information; on the other hand, a target main thread creating task queue is started, each seed URL after being read and correlated is sequentially sent to the task queue, capturing of target news information is carried out in a mode of creating the task queue, the phenomenon that each seed URL is repeatedly or is not captured can be avoided, and accuracy of capturing of the target news information is improved; finally, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads, inquiring whether idle sub-threads exist in the target sub-threads started by the target main thread every other preset period, and when the idle sub-threads exist in the target sub-threads, distributing each seed URL read by the target main thread to the idle sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.
Example III
Fig. 3 is a block diagram of a news information-capturing device according to a third embodiment of the present invention.
In some embodiments, the news information-capturing device 30 may include a plurality of functional modules composed of program code segments. Program code for each program segment in the news information-capturing device 30 may be stored in a memory of the electronic device and executed by the at least one processor to perform the news information-capturing function (described in detail with reference to fig. 1).
In this embodiment, the news information-capturing device 30 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: the system comprises an analysis module 301, a generation module 302, a reading module 303, a starting module 304, a grabbing module 305, a statistics module 306 and an identification module 307. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.
The parsing module 301 is configured to parse the received crawling request of the target news information to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capturing request is initiated to a server through a client, specifically, the client may be a mobile phone, an IPAD or other existing devices with a sending function, the server may be a capturing subsystem, in the capturing process, for example, the client may send the capturing request to the capturing subsystem, and when the server receives the capturing request sent by the client, the capturing request is analyzed.
In this embodiment, the crawling request includes crawling requirements, a seed URL, page content of the seed URL, page structure, and the like corresponding to the crawling target news information.
And the generating module 302 is configured to create a crawling policy for each seed URL, and generate a target news information crawling tree according to the plurality of seed URLs, where each crawling node of the target news information crawling tree includes a corresponding crawling policy.
In this embodiment, since the page content, the page structure, and the crawling requirements corresponding to each seed URL are different, a different crawling policy is created for each seed URL.
In an alternative embodiment, the generating module 302 creates a crawling policy for each of the seed URLs includes:
Analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring grabbing requirements corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique accesses, and page residence time, and the page structure refers to a hierarchical relationship of each page.
For example, the grabbing requirement corresponding to the first seed URL is text, and the text content and the text structure in the first seed URL are analyzed to set the corresponding grabbing strategy as follows: starting from a starting page of a first seed URL, randomly selecting one URL to enter, and grabbing target text content layer by layer until grabbing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the picture content and the picture structure in the second seed URL are analyzed to set the corresponding grabbing strategy as follows: predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with larger similarity for grabbing.
In an alternative embodiment, the generating module 302 generates the target news information-crawling tree according to the plurality of seed URLs includes:
Converting the grabbing node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information grabbing tree;
converting a reference relationship between grabbing nodes of each seed URL in the plurality of seed URLs into edges between nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree serve as the reference relationship between the nodes of the target news information grabbing tree;
and generating a target news information capture tree according to edges between the nodes of the target news information capture tree and the nodes in the target news information capture tree.
In this embodiment, the target news information capturing tree is generated according to the capturing nodes corresponding to the plurality of seed URLs and the referencing relationship between the capturing nodes of each seed URL, and specifically, the referencing relationship between the nodes of the target news information capturing tree may be preset, for example, may be preset according to the association degree between each seed URL and the target news information, or may be preset according to the association degree between the plurality of seed URLs.
In this embodiment, by creating a crawling policy for each seed URL, the crawling accuracy and efficiency of the target news information are improved, a target news information crawling tree is generated according to a plurality of seed URLs, crawling is performed from each node of the target news information crawling tree, the phenomenon of repeated crawling or crawling missing of the seed URLs is avoided, and the crawling accuracy of the target news information is improved.
And a reading module 303, configured to start the main line Cheng Zhuge to read the target seed URL and the corresponding crawling policy of each crawling node in the target news information crawling tree.
In this embodiment, when the server receives the crawling request, the main thread is started to read the target seed URL of each crawling node and the corresponding crawling policy in the target news information crawling tree one by one, so that a phenomenon of missing or repeated reading in the process of reading the target seed URL is avoided, and the reading accuracy of the target seed UR L is improved.
The starting module 304 is configured to start a plurality of sub-threads when detecting that the main thread reads a preset number of target seed URLs, and divide the preset number of target seed URLs read by the main thread into the plurality of sub-threads according to a preset allocation rule.
In this embodiment, an allocation rule may be preset, and specifically, the preset allocation rule may be equal division, random allocation, or allocation according to a certain multiple.
In this embodiment, by starting multiple sub-threads at the same time, the waiting time of the main thread for reading data and starting the sub-threads can be saved, so that the processing efficiency of the server is further improved.
In other alternative embodiments, the initiation module 304: and the method is also used for correspondingly starting a sub-thread when the main thread is detected to read the target seed URLs with the preset number, and distributing the target seed URLs with the preset data to the sub-thread.
In this embodiment, the preset number is a preset threshold value of the promoter threads.
For example, assuming that the preset number is 10 ten thousand, when the main thread reads the target seed URL of the 10 th ten thousand grabbing node from the 1 st grabbing node, the server correspondingly starts a sub-thread; then, when the main thread reads the target seed URL of the 20 th ten thousand grabbing nodes from the 10 th ten thousand 1 grabbing nodes, the server correspondingly starts a sub-thread. That is, when the server detects that the main thread reads target seed URLs of a preset number of grabbing nodes, a sub-thread is started. By starting a plurality of sub-threads and utilizing the sub-threads to process a corresponding number of target seed URLs, the capturing speed of news information in the target seed URLs can be accelerated to a certain extent.
And the grabbing module 305 is used for controlling each sub-thread to open each target seed URL read by the main thread by using the puppeter, and performing grabbing processing.
In this embodiment, the puppeterer is a node. Js library, and provides a high-level API to control Chrome or Chromium, and specifically, the default operation mode of the puppeterer is headless, but may be configured in a non-headless mode.
In an alternative embodiment, the crawling module 305 controls each of the sub-threads to open each target seed URL read by the main thread using puppeterer, and performs crawling processing including:
starting a headless browser to open each target seed URL read by the main thread and a corresponding grabbing strategy by using a Puppeterer;
jumping to a target page corresponding to the target seed URL;
and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to the target seed URL.
In this embodiment, since the puppeter can run on the linux server, a series of APIs are provided, convenience in calling APIs is ensured, and the purpleter is used to start the headless browser to open the target seed URL of each grabbing node.
And the statistics module 306 is configured to, after detecting that the plurality of sub-threads complete the capturing process, perform statistics on the capturing results of the plurality of sub-threads through the main thread to obtain a target capturing result of the target news information.
In this embodiment, after detecting that the multiple sub-threads complete capturing of the target news information, capturing results of each sub-thread are obtained through the main thread.
In this embodiment, the capturing result may include, but is not limited to: grabbing successful data, grabbing abnormal data, grabbing identified data, grabbing identical picture data and grabbing identical text data.
In this embodiment, the main thread acquires a grabbing result of one sub-thread, and stores the grabbing result in a cache of the server, and after all sub-threads are grabbed, the main thread counts grabbing results of all sub-threads, and counts according to all grabbing results to obtain a target grabbing result.
Further, in the process of detecting the plurality of sub-threads to grab the target news information, detecting whether an abnormal event occurs to the sub-threads; when detecting that an abnormal event occurs to a child thread, deleting the data after the child thread capturing processing of the abnormal event.
In this embodiment, by deleting the data after the capturing process of the sub-thread in which the abnormal event occurs, the accuracy of the target news information after the capturing process is ensured.
Further, an identifying module 307 is configured to identify a target grabbing node corresponding to the child thread that has an abnormal event; and verifying the target seed URL and the corresponding target grabbing strategy in the target grabbing node.
In some other optional embodiments, the verifying the target seed URL and the corresponding target crawling policy in the target crawling node includes:
matching a target seed URL in the target grabbing node with the seed URLs;
when the target seed URL in the target grabbing node is matched with any one seed URL in the plurality of seed URLs, judging whether the target grabbing strategy is the grabbing strategy of the target seed URL;
and when the target crawling policy is the crawling policy of the target seed URL, sending crawling suggestions to the client.
In the embodiment, by checking the target seed URL and the target grabbing strategy in the target grabbing node corresponding to the child thread with the abnormal event, the operation and maintenance personnel are assisted to rapidly analyze and grab the abnormal reason, and the working efficiency of the operation and maintenance personnel is improved.
In this embodiment, the crawling suggestion may be set according to the cause of the crawling anomaly, and in particular, the crawling suggestion may provide the crawling requirement for suggesting the client again or suggest the client to check whether the provided seed URL is in error. According to the embodiment, the capturing advice is sent to the client side, so that the client side is assisted in quickly making a decision, and the client experience and capturing efficiency are improved.
Further, when the target seed URL in the target grabbing node is not matched with any one of the seed URLs, the target seed URL in the target grabbing node is revised, and the revised target seed URL is grabbed secondarily.
Further, when the target grabbing strategy is not the grabbing strategy of the target seed URL, the grabbing strategy in the target grabbing node is modified, and the target seed URL in the target grabbing node is secondarily grabbed according to the modified grabbing strategy.
In this embodiment, the target seed URL and the capture policy in the abnormal target capture node are verified, and when the verification determines that the target seed URL and/or the capture policy are inconsistent, the target seed URL and/or the capture policy in the abnormal target capture node are corrected and then captured secondarily, thereby improving the integrity of captured target news information.
In summary, according to the news information capturing device of the present embodiment, on one hand, each sub-thread is controlled to open the target seed URL of each target capturing node read by the main thread by using puppeterer, and capture processing is performed, compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing processing of target news information is performed by starting a plurality of sub-threads by using puppeterer, so that the capturing efficiency of target news information is improved; on the other hand, starting a main thread to read target seed URLs of target grabbing nodes in the seed URL target news information grabbing tree and corresponding grabbing strategies one by one, avoiding the phenomenon of omission or repeated reading in the process of reading the target seed URLs, and improving the reading accuracy of the target seed URLs; and finally, creating a grabbing strategy for each seed URL, generating a target news information grabbing tree according to the seed URLs, and improving the grabbing accuracy and efficiency of the target news information by creating the grabbing strategy for each seed URL, generating the target news information grabbing tree according to the seed URLs, grabbing from each node of the target news information grabbing tree, thereby avoiding the phenomenon of repeated grabbing or missing grabbing of the seed URLs and improving the grabbing accuracy of the target news information.
Example IV
Fig. 4 is a block diagram of a news information-capturing device according to a fourth embodiment of the present invention.
In some embodiments, the news information-capturing device 40 may include a plurality of functional modules composed of program code segments. Program code for each of the program segments in the news information-capturing device 40 may be stored in a memory of the electronic device and executed by the at least one processor to perform the news information-capturing function (described in detail with reference to fig. 2).
In this embodiment, the news information-capturing device 40 may be divided into a plurality of functional modules according to the functions performed by the device. The functional module may include: parsing module 401, creating module 402, reading module 403, judging module 404, inquiring module 405, distributing module 406, grabbing module 407 and counting module 408. The module referred to in the present invention refers to a series of computer program segments capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.
The parsing module 401 is configured to parse the received crawling request of the target news information to obtain a plurality of seed URLs.
In this embodiment, when the target news information needs to be captured, a capturing request is initiated to a server through a client, specifically, the client may be a mobile phone, an IPAD or other existing devices with a sending function, the server may be a capturing subsystem, in the capturing process, for example, the client may send the capturing request to the capturing subsystem, and when the server receives the capturing request sent by the client, the capturing request is analyzed.
In this embodiment, the crawling request includes crawling requirements, a seed URL, page content of the seed URL, page structure, and the like corresponding to the crawling target news information.
A creating module 402, configured to create a crawling policy for each of the seed URLs, and associate each of the seed URLs with a corresponding crawling policy.
In this embodiment, since the page content, the page structure, and the crawling requirements corresponding to each seed URL are different, a different crawling policy is created for each seed URL.
In an alternative embodiment, the creating module 402 creates a crawling policy for each of the seed URLs includes:
analyzing the page content and the page structure in each seed URL to obtain an analysis result;
Acquiring grabbing requirements corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
In this embodiment, the page content includes content such as page viewing times, unique accesses, and page residence time, and the page structure refers to a hierarchical relationship of each page.
For example, the grabbing requirement corresponding to the first seed URL is text, and the text content and the text structure in the first seed URL are analyzed to set the corresponding grabbing strategy as follows: starting from a starting page of a first seed URL, randomly selecting one URL to enter, and grabbing target text content layer by layer until grabbing of the first seed URL is completed; the grabbing requirement corresponding to the second seed URL is a picture, and the picture content and the picture structure in the second seed URL are analyzed to set the corresponding grabbing strategy as follows: predicting the similarity between each URL in the second seed URL and the second seed URL, and selecting a plurality of URLs with larger similarity for grabbing.
In this embodiment, by creating a capture policy for each seed URL, the capturing accuracy and efficiency of the target news information are improved.
And the reading module 403 is configured to start the target main thread to create a task queue, and read each associated seed URL and send each seed URL to the task queue in sequence.
In this embodiment, because each seed URL has an association relationship with a corresponding crawling policy, when a server receives a crawling request, the server starts a target main thread to create a task queue according to the crawling request, and sends each seed URL after association to the task queue in sequence.
In this embodiment, the capturing of the target news information is performed by creating the task queue, so that the phenomenon that each seed URL is repeatedly or omitted from capturing can be avoided, and the accuracy of capturing the target news information is improved.
And a judging module 404, configured to judge whether the same seed URL exists in the task queue.
In this embodiment, the task queue is judged whether to have the same seed URL, so as to avoid repeated grabbing of the same seed URL, thereby improving the grabbing accuracy and efficiency of the target news information.
Further, when the same seed URL exists in the task queue, continuing to read each associated seed URL and sequentially sending the seed URL to the task queue.
And the query module 405 is configured to query, when the same seed URL does not exist in the task queue, whether idle sub-threads exist in a plurality of target sub-threads started by the target main thread every preset period.
In this embodiment, a preset period may be preset, and specifically, the preset period may be 1 minute or 30 seconds. The idle sub-thread means that the sub-thread is currently free of tasks.
Further, when no idle sub-thread exists in the target sub-threads started by the target main thread, continuously inquiring whether the idle sub-thread exists in the target sub-threads started by the target main thread or not at preset intervals.
And the distributing module 406 is configured to, when there are idle sub-threads in the multiple target sub-threads started by the target main thread, distribute each seed URL read by the target main thread to the idle sub-threads.
In this embodiment, whether an idle sub-thread exists in a target sub-thread started by the target main thread is queried every preset period, and when the idle sub-thread exists in the target sub-thread, each seed URL read by the target main thread is distributed to the idle sub-thread, so that a phenomenon of uneven tasks in the target sub-thread is avoided, and the reading speed of each seed URL of each sub-thread is improved.
And the grabbing module 407 is configured to control the idle sub-thread to open each seed URL read by the target main thread by using puppeter, and grab target news information for each seed URL according to a corresponding grabbing policy.
In this embodiment, the puppeterer is a node. Js library, and provides a high-level API to control Chrome or Chromium, and specifically, the default operation mode of the puppeterer is headless, but may be configured in a non-headless mode.
In an alternative embodiment, the crawling module 407 controls the idle sub-thread to open each of the seed URLs read by the target main thread using puppeter, and performs target news information crawling on each of the seed URLs according to a corresponding crawling policy, including:
starting a headless browser by using a Puppeterer to open each seed URL and a corresponding grabbing strategy read by the target main thread;
jumping to a target page corresponding to each seed URL;
and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to each seed URL.
In this embodiment, since the puppeter can run on the linux server, a series of APIs are provided, convenience in calling APIs is ensured, starting the headless browser to open each seed URL using the puppeter, compared with a non-headless browser, the rendering work of the real browser is reduced, the reading efficiency is improved, and the capturing efficiency of the target news information is improved by starting a plurality of target sub-threads to use the puppeter to capture the target news information.
And the statistics module 408 is configured to, after detecting that the idle sub-thread completes capturing the target news information, obtain a capturing result of the target news information by performing statistics on the capturing result of the idle sub-thread by the target main thread.
In this embodiment, after the idle sub-thread is detected to complete capturing of the target news information, a capturing result of each idle sub-thread is obtained through the target main thread.
In this embodiment, the capturing result may include, but is not limited to: grabbing successful data, grabbing abnormal data, grabbing identified data, grabbing identical picture data and grabbing identical text data.
In this embodiment, the target main thread acquires a grabbing result of an idle sub-thread, and stores the grabbing result in a cache of the server. And after all the idle sub-threads are grabbed, the target main thread counts the grabbing results of all the idle sub-threads, and counts according to all the grabbing results to obtain the target grabbing results.
Further, in the process of detecting the idle sub-thread to grab the target news information, detecting whether an abnormal event occurs to the idle sub-thread; when detecting that an abnormal event occurs in the idle sub-thread, deleting the data after the idle sub-thread capturing processing of the abnormal event occurs.
In this embodiment, by deleting the data after the capturing process of the idle sub-thread in which the abnormal event occurs, the accuracy of the target news information after the capturing process is ensured.
In summary, according to the news information capturing device in this embodiment, on one hand, the idle sub-thread is controlled to use puppeterer to open each seed URL read by the target main thread, and target news information capturing is performed on each seed URL according to a corresponding capturing policy, and a headless browser is started to open each seed URL, so that compared with a non-headless browser, the rendering work of a real browser is reduced, the reading efficiency is accelerated, and the capturing efficiency of target news information is improved by starting a plurality of target sub-threads to use puppeterer to perform capturing processing of target news information; on the other hand, a target main thread creating task queue is started, each seed URL after being read and correlated is sequentially sent to the task queue, capturing of target news information is carried out in a mode of creating the task queue, the phenomenon that each seed URL is repeatedly or is not captured can be avoided, and accuracy of capturing of the target news information is improved; finally, when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads, inquiring whether idle sub-threads exist in the target sub-threads started by the target main thread every other preset period, and when the idle sub-threads exist in the target sub-threads, distributing each seed URL read by the target main thread to the idle sub-threads, so that the phenomenon of uneven tasks in the target sub-threads is avoided, and the reading speed of each seed URL of each sub-thread is improved.
Example five
Fig. 5 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In a preferred embodiment of the invention, the electronic device 5 comprises a memory 51, at least one processor 52, at least one communication bus 53 and a transceiver 54.
It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 5 is not limiting of the embodiments of the present invention, and that either a bus-type configuration or a star-type configuration may be used, and that the electronic device 5 may include more or less other hardware or software than that shown, or a different arrangement of components.
In some embodiments, the electronic device 5 is an electronic device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 5 may also include a client device, which includes, but is not limited to, any electronic product that can interact with a client by way of a keyboard, mouse, remote control, touch pad, or voice control device, such as a personal computer, tablet, smart phone, digital camera, etc.
It should be noted that the electronic device 5 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.
In some embodiments, the memory 51 is used to store program codes and various data, such as news information-capturing devices 30 or 40 installed in the electronic device 5, and to implement high-speed, automatic access to programs or data during operation of the electronic device 5. The Memory 51 includes Read-Only Memory (ROM), programmable Read-Only Memory (PROM), erasable programmable Read-Only Memory (EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable rewritable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic tape Memory, or any other medium that can be used for computer-readable carrying or storing data.
In some embodiments, the at least one processor 52 may be comprised of an integrated circuit, such as a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The at least one processor 52 is a Control Unit (Control Unit) of the electronic device 5, connects the respective components of the entire electronic device 5 using various interfaces and lines, and executes various functions of the electronic device 5 and processes data by running or executing programs or modules stored in the memory 51, and calling data stored in the memory 51.
In some embodiments, the at least one communication bus 53 is arranged to enable connected communication between the memory 51 and the at least one processor 52 or the like.
Although not shown, the electronic device 5 may further include a power source (such as a battery) for powering the various components, and optionally, the power source may be logically connected to the at least one processor 52 via a power management device, thereby performing functions such as managing charging, discharging, and power consumption via the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 5 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.
It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to perform portions of the methods described in the various embodiments of the invention.
In a further embodiment, in conjunction with fig. 3 or fig. 4, the at least one processor 52 may execute the operating device of the electronic device 5, as well as various installed applications (such as the news information-capturing device 30 or 40), program code, etc., such as the various modules described above.
The memory 51 has stored therein program code, and the at least one processor 52 can invoke the program code stored in the memory 51 to perform related functions. For example, each of the modules described in fig. 3 or fig. 4 is a program code stored in the memory 51 and executed by the at least one processor 52, thereby realizing the functions of each of the modules for the purpose of capturing news information.
In one embodiment of the present invention, the memory 51 stores a plurality of instructions that are executed by the at least one processor 52 to perform the function of news information crawling.
Specifically, the specific implementation method of the above instruction by the at least one processor 52 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1 or fig. 2, which is not repeated herein.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.
The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it will be obvious that the term "comprising" does not exclude other elements or that the singular does not exclude a plurality. The units or means stated in the invention may also be implemented by one unit or means, either by software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. A news information capturing method, the method comprising:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
starting a main thread to read target seed URLs of each grabbing node in the target news information grabbing tree one by one and corresponding grabbing strategies;
when the main thread is detected to read a preset number of target seed URLs, starting a plurality of sub-threads, and dividing the preset number of target seed URLs read by the main thread into the plurality of sub-threads according to a preset distribution rule;
controlling each sub-thread to open each target seed URL read by the main thread by using a Puppeterer, and performing grabbing processing;
and after detecting that the plurality of sub-threads finish grabbing processing, counting grabbing results of the plurality of sub-threads through the main thread to obtain target grabbing results of the target news information.
2. The news information-crawling method of claim 1, wherein said creating a crawling policy for each of said seed URLs comprises:
Analyzing the page content and the page structure in each seed URL to obtain an analysis result;
acquiring grabbing requirements corresponding to each seed URL;
and creating a grabbing strategy for each seed URL according to the analysis result of each seed URL and the corresponding grabbing requirement.
3. The news information-crawling method of claim 1, wherein the generating a target news information-crawling tree from the plurality of seed URLs comprises:
converting the grabbing node of each seed URL in the plurality of seed URLs into a node of a corresponding target news information grabbing tree;
converting a reference relationship between grabbing nodes of each seed URL in the plurality of seed URLs into edges between nodes in a corresponding target news information grabbing tree, wherein the edges between the nodes in the target news information grabbing tree serve as the reference relationship between the nodes of the target news information grabbing tree;
and generating a target news information capture tree according to edges between the nodes of the target news information capture tree and the nodes in the target news information capture tree.
4. The news information-capturing method of claim 1, wherein controlling each of the sub-threads to open each target seed URL read by the main thread using puppeter, and performing capturing processing includes:
Starting a headless browser to open each target seed URL read by the main thread and a corresponding grabbing strategy by using a Puppeterer;
jumping to a target page corresponding to the target seed URL;
and calling the Puppeterer to grab the target page according to the grabbing strategy corresponding to the target seed URL.
5. The news information-capturing method of claim 1, wherein the method further comprises:
detecting whether an abnormal event occurs to a child thread;
when detecting that an abnormal event occurs to a child thread, identifying a target grabbing node corresponding to the child thread with the abnormal event;
and verifying the target seed URL and the corresponding target grabbing strategy in the target grabbing node.
6. The news information-capturing method of claim 5, wherein the verifying the target seed URL and the corresponding target capturing policy in the target capturing node comprises:
matching a target seed URL in the target grabbing node with the seed URLs;
when the target seed URL in the target grabbing node is matched with any one seed URL in the plurality of seed URLs, judging whether the target grabbing strategy is the grabbing strategy of the target seed URL;
When the target grabbing strategy is the grabbing strategy of the target seed URL, sending grabbing suggestions to a client; or alternatively
When the target grabbing strategy is not the grabbing strategy of the target seed URL, correcting the grabbing strategy in the target grabbing node, and secondarily grabbing the target seed URL in the target grabbing node according to the corrected grabbing strategy.
7. A news information capturing method, the method comprising:
analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
creating a grabbing strategy for each seed URL, and associating each seed URL with a corresponding grabbing strategy;
starting a target main thread to create a task queue, reading each associated seed URL and sequentially sending the seed URL to the task queue;
judging whether the same seed URL exists in the task queue;
when the same seed URL does not exist in the task queue, inquiring whether idle sub-threads exist in a plurality of target sub-threads started by the target main thread or not at intervals of a preset period;
when idle sub-threads exist in a plurality of target sub-threads started by the target main thread, distributing each seed URL read by the target main thread to the idle sub-threads;
Controlling the idle sub-thread to open each seed URL read by the target main thread by using a Puppeterer, and capturing target news information for each seed URL according to a corresponding capturing strategy;
and counting the grabbing results of the idle sub-threads through the target main thread after the idle sub-threads are detected to finish grabbing target news information, so as to obtain grabbing results of the target news information.
8. A news information-capturing apparatus, the apparatus comprising:
the analysis module is used for analyzing the received grabbing request of the target news information to obtain a plurality of seed URLs;
the generation module is used for creating a grabbing strategy for each seed URL and generating a target news information grabbing tree according to the seed URLs, wherein each grabbing node of the target news information grabbing tree comprises a corresponding grabbing strategy;
the reading module is used for starting the main line Cheng Zhuge to read the target seed URL and the corresponding grabbing strategy of each grabbing node in the target news information grabbing tree;
the starting module is used for starting a plurality of sub-threads when detecting that the main thread reads a preset number of target seed URLs, and distributing the preset number of target seed URLs read by the main thread to the plurality of sub-threads according to a preset distribution rule;
The grabbing module is used for controlling each sub-thread to open each target seed URL read by the main thread by using the Puppeterer and carrying out grabbing processing;
and the statistics module is used for counting the grabbing results of the plurality of sub-threads through the main thread to obtain the target grabbing results of the target news information after the grabbing processing of the plurality of sub-threads is detected.
9. An electronic device comprising a processor and a memory, wherein the processor is configured to implement the news information-capturing method according to any one of claims 1 to 6 or claim 7 when executing a computer program stored in the memory.
10. A computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the news information-capturing method according to any one of claims 1 to 6 or claim 7.
CN202110432611.3A 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium Active CN113065055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110432611.3A CN113065055B (en) 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110432611.3A CN113065055B (en) 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113065055A CN113065055A (en) 2021-07-02
CN113065055B true CN113065055B (en) 2024-04-02

Family

ID=76567315

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110432611.3A Active CN113065055B (en) 2021-04-21 2021-04-21 News information capturing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113065055B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116594756B (en) * 2023-07-17 2023-11-03 深圳市豪斯莱科技有限公司 Task processing method, device, terminal equipment and storage medium

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6351755B1 (en) * 1999-11-02 2002-02-26 Alta Vista Company System and method for associating an extensible set of data with documents downloaded by a web crawler
KR100875636B1 (en) * 2007-09-19 2008-12-26 한국과학기술정보연구원 Web crawler system based on grid computing, and method thereof
CN101689176A (en) * 2007-05-29 2010-03-31 怡斯福乐株式会社 Method for grasping information of web site through analyzing structure of web page
CN101715004A (en) * 2009-11-12 2010-05-26 中国科学院计算技术研究所 Internet video-oriented distributed acquisition method and system
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103902732A (en) * 2014-04-18 2014-07-02 北京大学 Construction and network resource collection method of self-adaption network resource collection system
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104731971A (en) * 2015-04-11 2015-06-24 淮阴工学院 Campus personalized palm service and user behavior habit analysis achieving method
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN109254908A (en) * 2018-08-03 2019-01-22 北京达佳互联信息技术有限公司 Visualize regression testing method, device, terminal device and readable storage medium storing program for executing
CN110569414A (en) * 2019-08-21 2019-12-13 时趣互动(北京)科技有限公司 puppeteeer-based website data collection method
CN110851682A (en) * 2019-10-17 2020-02-28 上海易点时空网络有限公司 Text anti-crawler method, server and display terminal
CN110851681A (en) * 2019-10-12 2020-02-28 平安科技(深圳)有限公司 Crawler processing method and device, server and computer readable storage medium
CN111552854A (en) * 2020-04-24 2020-08-18 北京明略软件***有限公司 Webpage data capturing method and device, storage medium and equipment
CN112068824A (en) * 2020-09-16 2020-12-11 杭州海康威视数字技术股份有限公司 Webpage development preview method and device and electronic equipment
CN112256984A (en) * 2020-10-22 2021-01-22 上海悦易网络信息技术有限公司 Method and device for acquiring interface background screenshot corresponding to webpage
CN112637361A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Page proxy method, device, electronic equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11494446B2 (en) * 2019-09-23 2022-11-08 Arizona Board Of Regents On Behalf Of Arizona State University Method and apparatus for collecting, detecting and visualizing fake news

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6351755B1 (en) * 1999-11-02 2002-02-26 Alta Vista Company System and method for associating an extensible set of data with documents downloaded by a web crawler
CN101689176A (en) * 2007-05-29 2010-03-31 怡斯福乐株式会社 Method for grasping information of web site through analyzing structure of web page
KR100875636B1 (en) * 2007-09-19 2008-12-26 한국과학기술정보연구원 Web crawler system based on grid computing, and method thereof
CN101715004A (en) * 2009-11-12 2010-05-26 中国科学院计算技术研究所 Internet video-oriented distributed acquisition method and system
CN102184227A (en) * 2011-05-10 2011-09-14 北京邮电大学 General crawler engine system used for WEB service and working method thereof
CN103902732A (en) * 2014-04-18 2014-07-02 北京大学 Construction and network resource collection method of self-adaption network resource collection system
CN104050037A (en) * 2014-06-13 2014-09-17 淮阴工学院 Implementation method for directional crawler based on assigned e-commerce website
CN104731971A (en) * 2015-04-11 2015-06-24 淮阴工学院 Campus personalized palm service and user behavior habit analysis achieving method
CN105893583A (en) * 2016-04-01 2016-08-24 北京鼎泰智源科技有限公司 Data acquisition method and system based on artificial intelligence
CN109254908A (en) * 2018-08-03 2019-01-22 北京达佳互联信息技术有限公司 Visualize regression testing method, device, terminal device and readable storage medium storing program for executing
CN110569414A (en) * 2019-08-21 2019-12-13 时趣互动(北京)科技有限公司 puppeteeer-based website data collection method
CN110851681A (en) * 2019-10-12 2020-02-28 平安科技(深圳)有限公司 Crawler processing method and device, server and computer readable storage medium
CN110851682A (en) * 2019-10-17 2020-02-28 上海易点时空网络有限公司 Text anti-crawler method, server and display terminal
CN111552854A (en) * 2020-04-24 2020-08-18 北京明略软件***有限公司 Webpage data capturing method and device, storage medium and equipment
CN112068824A (en) * 2020-09-16 2020-12-11 杭州海康威视数字技术股份有限公司 Webpage development preview method and device and electronic equipment
CN112256984A (en) * 2020-10-22 2021-01-22 上海悦易网络信息技术有限公司 Method and device for acquiring interface background screenshot corresponding to webpage
CN112637361A (en) * 2020-12-29 2021-04-09 北京天融信网络安全技术有限公司 Page proxy method, device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多线程并发网络爬虫的设计与实现;邵晓文;《现代计算机(专业版)》(第1期);97-100 *

Also Published As

Publication number Publication date
CN113065055A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
US11531909B2 (en) Computer system and method for machine learning or inference
CN111949708B (en) Multi-task prediction method, device, equipment and medium based on time sequence feature extraction
CN106980533B (en) Task scheduling method and device based on heterogeneous processor and electronic equipment
US20140365833A1 (en) Capturing trace information using annotated trace output
CN113094674B (en) Page display method and device, electronic equipment and storage medium
CN113343154B (en) Page loading method and device, electronic equipment and storage medium
CN111475764B (en) Search engine optimization method, device, terminal and storage medium
CN115373835A (en) Task resource adjusting method and device for Flink cluster and electronic equipment
CN111694843B (en) Missing number detection method and device, electronic equipment and storage medium
CN113065055B (en) News information capturing method and device, electronic equipment and storage medium
CN112948275A (en) Test data generation method, device, equipment and storage medium
WO2023040145A1 (en) Artificial intelligence-based text classification method and apparatus, electronic device, and medium
CN112948723A (en) Interface calling method and device and related equipment
Wu et al. A heuristic speculative execution strategy in heterogeneous distributed environments
CN114034972B (en) Intelligent cable fault determining method and device based on image data
CN112631551B (en) Random number generation method, device, electronic equipment and storage medium
CN114881313A (en) Behavior prediction method and device based on artificial intelligence and related equipment
CN112685634A (en) Data query method and device, electronic equipment and storage medium
CN112966205B (en) Webpage opening method and device, electronic equipment and storage medium
CN112055010A (en) Two-dimensional code picture intercepting method and device, electronic equipment and storage medium
CN111767500A (en) Data storage sharing method and device, computer equipment and storage medium
CN111199040B (en) Page tamper detection method, device, terminal and storage medium
CN114374727B (en) Data calling method and device based on artificial intelligence, electronic equipment and medium
CN113254728B (en) Task information display method and device, electronic equipment and storage medium
CN115190016B (en) System general switch configuration method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211018

Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant after: Shenzhen saiante Technology Service Co.,Ltd.

Address before: 1-34 / F, Qianhai free trade building, 3048 Xinghai Avenue, Mawan, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong 518000

Applicant before: Ping An International Smart City Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant