CN116955869A - News link extraction method and device, storage medium and electronic equipment - Google Patents

News link extraction method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN116955869A
CN116955869A CN202310931750.XA CN202310931750A CN116955869A CN 116955869 A CN116955869 A CN 116955869A CN 202310931750 A CN202310931750 A CN 202310931750A CN 116955869 A CN116955869 A CN 116955869A
Authority
CN
China
Prior art keywords
news
homepage
link
extraction
links
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310931750.XA
Other languages
Chinese (zh)
Inventor
巩朋贤
李顺
吴若冰
裘骐
李茁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Relations, University of
Beijing Huaying Ansheng Technology Development Co ltd
Original Assignee
International Relations, University of
Beijing Huaying Ansheng Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Relations, University of, Beijing Huaying Ansheng Technology Development Co ltd filed Critical International Relations, University of
Priority to CN202310931750.XA priority Critical patent/CN116955869A/en
Publication of CN116955869A publication Critical patent/CN116955869A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9035Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The disclosure provides a news link extraction method, a news link extraction device, electronic equipment and a storage medium, and relates to the technical field of computers. The method comprises the following steps: acquiring a homepage link of a news website homepage; extracting link data according to the homepage links to obtain homepage news link extraction results, homepage non-news link extraction results and homepage news link update speed of the homepage of the news website; judging whether a news plate page exists in the homepage of the news website according to the homepage non-news link extraction result and the homepage news link updating speed; extracting a plate page news link extraction result in a news plate page under the condition that the news plate page exists; and determining the website news link extraction result of the news website homepage according to the homepage news link extraction result and the plate page news link extraction result. The method can continuously, efficiently and comprehensively output all website news link extraction results in the news website homepage.

Description

News link extraction method and device, storage medium and electronic equipment
Technical Field
The disclosure relates to the field of computer technology, and in particular, to a news link extraction method and device, a storage medium and electronic equipment.
Background
With the development of computer technology and internet technology, browsing information through news websites has become an important way to acquire news.
In the related art, news links in a news website can be extracted by analyzing a web page structure and combining manual labeling. Because different websites have completely different website organization structures, the position information marked manually on one news website cannot be migrated to other news websites, so that the mode has poor reusability; and the manual labeling process is time-consuming and labor-consuming, and has low accuracy.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a news link extraction method, apparatus, electronic device, and storage medium, so as to continuously, efficiently and comprehensively output all website news link extraction results in a news website homepage.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to one aspect of the present disclosure, there is provided a news link extraction method, including: acquiring a homepage link of a news website homepage; extracting link data according to the homepage links to obtain homepage news link extraction results, homepage non-news link extraction results and homepage news link update speed of the homepage of the news website; judging whether a news plate page exists in the homepage of the news website according to the homepage non-news link extraction result and the homepage news link updating speed; extracting a plate page news link extraction result in a news plate page under the condition that the news plate page exists; and determining the website news link extraction result of the news website homepage according to the homepage news link extraction result and the plate page news link extraction result.
In one embodiment of the present disclosure, determining whether a news slab page exists in a news website homepage according to a homepage non-news link extraction result and a homepage news link update speed includes: extracting link data according to each homepage non-news link in the homepage non-news link extraction result to obtain the sub-page news update speed of the sub-page pointed by each homepage non-news link; if the target sub page with the ratio of the sub page news update speed to the homepage news link update speed being larger than the update threshold exists in the sub pages, determining that a news plate page exists in the homepage of the news website, and determining the target sub page as the news plate page.
In one embodiment of the present disclosure, link data extraction is performed according to a homepage link to obtain a homepage news link extraction result of a news website homepage, including: determining a plurality of extraction moments based on a preset extraction frequency, and obtaining a same domain name link set of a news website homepage at each extraction moment according to homepage links; at the current extraction moment, a first preset number of links sets with the same domain name are obtained by utilizing a preset news extraction sliding window; the first preset number of the same domain name link sets correspond to the current extraction time and the extraction time before the current extraction time; obtaining newly added homepage news links at the current extraction moment according to a first preset number of links with the same domain name; and outputting the newly added homepage news links at the current extraction moment as homepage news link extraction results of the homepage of the news website.
In one embodiment of the present disclosure, the first preset number is 2; the first preset number of identical domain name link sets comprise a current identical domain name link set at the current extraction time and a last identical domain name link set at the last time of the current extraction time; the method for obtaining the news links of the newly added homepage at the current extraction moment according to the first preset number of links set with the same domain name comprises the following steps: determining a difference set between the current same domain name link set and the last same domain name link set; and determining a new homepage news link at the current extraction moment according to the links in the difference set.
In one embodiment of the present disclosure, link extraction is performed according to a homepage link to obtain a homepage non-news link extraction result of a homepage of a news website, including: at the current extraction moment, a second preset number of links sets with the same domain name are obtained by utilizing a preset non-news extraction sliding window; the second preset number of the same domain name link sets corresponds to the current extraction time and the extraction time before the current extraction time; acquiring homepage non-news links at the current extraction moment according to a second preset number of links with the same domain name; and outputting the non-news links of the homepage at the current extraction moment as the non-news links of the homepage of the news website.
In one embodiment of the present disclosure, the second preset number is greater than the first preset number; the method for obtaining the homepage non-news links at the current extraction moment according to the second preset number of links with the same domain name comprises the following steps: determining intersections of a second preset number of links to the domain name; and determining the non-news links of the homepage at the current extraction moment according to the links in the intersection.
In one embodiment of the present disclosure, link data extraction is performed according to a homepage link to obtain a homepage news link update speed of a news website homepage, including: acquiring a preset time period containing the current extraction time, and determining the total extraction times of the extraction time in the preset time period; determining total content of the newly added homepage news links according to the newly added homepage news links at all the extraction moments in a preset period; and determining the main page news link updating speed of the main page of the news website in a preset period according to the total quantity of the newly added main page news links and the total extraction times.
According to another aspect of the present disclosure, there is provided a news link extraction apparatus including: the acquisition module is used for acquiring homepage links of the homepage of the news website; the homepage link data extraction module is used for extracting link data according to homepage links to obtain homepage news link extraction results, homepage non-news link extraction results and homepage news link update speed of the homepage of the news website; the judging module is used for judging whether a news plate page exists in the homepage of the news website according to the homepage non-news link extraction result and the homepage news link updating speed; the plate link extraction module is used for extracting plate page news link extraction results in the news plate pages under the condition that the news plate pages exist; and the determining module is used for determining the website news link extraction result of the news website homepage according to the homepage news link extraction result and the plate page news link extraction result.
According to yet another aspect of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the news link extraction method described above.
According to still another aspect of the present disclosure, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the news link extraction method described above via execution of the executable instructions.
According to the news link extraction method provided by the embodiment of the disclosure, link data extraction can be firstly carried out on a news website homepage to obtain a homepage news link extraction result, a homepage non-news link extraction result and a homepage news link update speed of the homepage, then whether a news plate page exists in the news website homepage is further judged according to the homepage non-news link extraction result and the homepage news link update speed, and then the plate page news link extraction result in the news plate page is extracted under the condition that the news plate page exists, and then the homepage news link extraction result and the plate page news link extraction result are used together as a final website news link extraction result of the news website homepage to be output. The method can automatically and efficiently obtain the news links of newly added homepage at each extraction moment, rapidly judge whether news plate pages exist in the homepage of the news website, and then automatically and efficiently obtain the news link extraction results of plate pages in the news plate pages; therefore, the method not only can quickly obtain the news links in the news website homepage, but also can mine the news plate pages in the news website homepage and quickly obtain the news links in the news plate pages, so that all website news link extraction results in the news website homepage can be continuously, efficiently and comprehensively output.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.
FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which a news link extraction method of embodiments of the present disclosure may be applied;
FIG. 2 illustrates a flow diagram of a method of news link extraction in accordance with one embodiment of the present disclosure;
FIG. 3 illustrates a flowchart of obtaining homepage news link extraction results in a news link extraction method according to one embodiment of the present disclosure;
FIG. 4 illustrates a flow chart of obtaining home page non-news link extraction results in a news link extraction method according to one embodiment of the present disclosure;
FIG. 5 illustrates a flow chart for obtaining a home news link update rate in a news link extraction method according to one embodiment of the present disclosure;
FIG. 6 is a flowchart of determining whether a news slab page exists in a news website homepage in a news link extraction method according to one embodiment of the present disclosure;
FIG. 7 illustrates a schematic diagram of a news link extraction method of one embodiment of the present disclosure;
FIG. 8 illustrates a schematic diagram of a news link extractor in a news link extraction method according to one embodiment of the present disclosure;
FIG. 9 illustrates a schematic diagram of a tile link extractor in a news link extraction method according to one embodiment of the present disclosure;
FIG. 10 illustrates a block diagram of a news link extraction device of one embodiment of the present disclosure; and
fig. 11 shows a block diagram of a news link extraction computer device in an embodiment of the disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present disclosure, the meaning of "a plurality" is at least two, such as two, three, etc., unless explicitly specified otherwise.
FIG. 1 illustrates a schematic diagram of an exemplary system architecture to which the news link extraction method of embodiments of the present disclosure may be applied.
As shown in fig. 1, the system architecture may include a server 101, a network 102, and a client 103. Network 102 is the medium used to provide communication links between clients 103 and server 101. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.
In an exemplary embodiment, the client 103 in data transmission with the server 101 may include, but is not limited to, a smart phone, a desktop computer, a tablet computer, a notebook computer, a smart speaker, a digital assistant, an AR (Augmented Reality) device, a VR (Virtual Reality) device, a smart wearable device, and the like. Alternatively, the operating system running on the electronic device may include, but is not limited to, an android system, an IOS system, a linux system, a windows system, and the like.
The server 101 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. In some practical applications, the server 101 may also be a server of a network platform, and the network platform may be, for example, a transaction platform, a live broadcast platform, a social platform, or a music platform, which is not limited in the embodiments of the present disclosure. The server may be one server or may be a cluster formed by a plurality of servers, and the specific architecture of the server is not limited in this disclosure.
In an exemplary embodiment, an input interface for receiving a homepage link of a news website homepage may be installed in the client 103, and a system capable of implementing the news link extraction method of the embodiments of the present disclosure may be deployed in the server 101; after the client 103 transmits the received homepage link input by the user to the server 101, the server 101 may start processing the homepage link, implementing the news link extraction method of the embodiment of the present disclosure.
In an exemplary embodiment, the procedure of the server 101 for implementing the news link extraction method may be: the server 101 acquires a homepage link of a news website homepage; the server 101 extracts link data according to the homepage links to obtain homepage news link extraction results, homepage non-news link extraction results and homepage news link update speed of the homepage of the news website; the server 101 judges whether a news plate page exists in the homepage of the news website according to the homepage non-news link extraction result and the homepage news link update speed; the server 101 extracts a block page news link extraction result in a news block page in the case that a news block page exists; the server 101 determines a web news link extraction result of the news web site homepage from the homepage news link extraction result and the board page news link extraction result.
In addition, it should be noted that, fig. 1 is only one application environment of the news link extraction method provided by the present disclosure. The number of servers 101, networks 102, and clients 103 in FIG. 1 is merely illustrative, and any number of clients, networks, and servers may be provided as desired.
In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the following describes in more detail each step of the news link extraction method in the exemplary embodiment of the present disclosure with reference to the accompanying drawings and embodiments.
Fig. 2 shows a flowchart of a news link extraction method according to an embodiment of the present disclosure. The method provided by the embodiments of the present disclosure may be performed by the server 101 or the client 103 as shown in fig. 1, but the present disclosure is not limited thereto.
In the following explanation, the server 101 is exemplified as an execution subject.
As shown in fig. 2, the news link extraction method provided by the embodiment of the present disclosure may include the following steps.
Step S201, acquiring a homepage link of a news website homepage.
The homepage of the news website can be a homepage of the news website, and the homepage link can be a hyperlink which can jump to the homepage of the news website after being clicked, that is, the homepage pointed by the homepage link is the homepage of the news website.
Step S203, link data extraction is performed according to the homepage links, and the homepage news link extraction result, the homepage non-news link extraction result and the homepage news link update speed of the homepage of the news website are obtained.
The homepage link can be used for accessing the homepage of the news website to obtain homepage data, and then related link data is extracted from the homepage data, so that homepage news link extraction results, homepage non-news link extraction results and homepage news link update speed are obtained.
Fig. 3 is a flowchart showing a method for obtaining a homepage news link extraction result in the news link extraction method according to an embodiment of the present disclosure. As shown in fig. 3, in some embodiments, "link data extraction according to homepage links, obtaining homepage news link extraction results of a news website homepage" in step S203 may include the following steps.
Step S301, a plurality of extraction moments are determined based on a preset extraction frequency, and a same domain name link set of a news website homepage at each extraction moment is obtained according to homepage links.
In this step, the preset extraction frequency may be set based on actual conditions, and may be, for example, 1 hour of extraction, every 6 hours of extraction, or the like. Assuming that the extraction is once every 1 hour, the multiple extraction times may be 0, 1, 2, … …, 22, 23 points during the day; assuming that extraction is performed every 6 hours, the plurality of extraction times may be 0, 6, 12, 18 points during the day.
The homepage links can be processed through a crawler technology, and homepage data in the homepage of the news website can be obtained by crawling; the link data in the homepage data can be determined, and links with the same domain name are screened from the link data based on the domain name of the homepage link. That is, the same links in the link data as the domain name of the homepage link may be determined as the same domain name links, and thus the same domain name link set is obtained.
In some practical applications, some "friendly links" different from the website domain name exist in the website, and the "friendly links" are not links of the same domain name; for example, at the end of the homepage of the chinese news net-comb world news (chinanews. Com), there are a series of links of other domain names different from "chinanews. Com", and after excluding the links of these other domain names, the link set of the same domain name of the homepage of the news website can be obtained.
Step S303, obtaining a first preset number of links set with the same domain name by utilizing a preset news extraction sliding window at the current extraction moment; the first preset number of the same domain name link sets corresponds to the current extraction time and the extraction time before the current extraction time.
The first preset number may be the number of elements that can be processed by the news extraction sliding window at a time, that is, the first preset number may be the window size (window size) of the news extraction sliding window. Further, assuming that the current point in time is exactly one extraction time, the current extraction time may be understood as the current time.
For example, if the first preset number is 2, a total of two identical domain name link sets at the current time and the time before the current time can be obtained by using the news extraction sliding window at the current time.
When processing a plurality of continuous extraction moments, a news extraction sliding window can be utilized to carry out sliding processing on a plurality of continuous links with the same domain name by taking a sliding step length as 1; for example, assuming that link data extraction is performed every 1 hour, at 6 months 25 days 15 (i.e., the current extraction time is 6 months 25 days 15), the same domain name link set extracted at 6 months 25 days 14 and the same domain name link set extracted at 15 may be put into a news extraction sliding window; at the next extraction time, i.e., at 16 days 6 months 25 (i.e., at 16 days 6 months 25 as the new current extraction time), the same domain name link set extracted at 15 days 6 months 25 and the same domain name link set extracted at 16 may be placed in a news extraction sliding window, … …, and so on, at each current extraction time, a total of two same domain name link sets at the current extraction time and the last extraction time may be obtained for subsequent processing.
Step S305, obtaining newly added homepage news links at the current extraction moment according to a first preset number of links set with the same domain name.
In some embodiments, the first preset number may be 2; the first preset number of co-domain name link sets may include a current co-domain name link set at a current extraction time and a last co-domain name link set at a time immediately preceding the current extraction time. Based on this, step S305 may further include: determining a difference set between the current same domain name link set and the last same domain name link set; and determining a new homepage news link at the extraction moment according to the links in the difference set.
In this step, a difference set between two adjacent sets of links with the same domain name can be calculated, and a newly added homepage news link is determined for output according to the difference set. The calculation of the difference set can be understood as follows: the difference set between the set a and the set b represents a set of elements in the set a and not in the set b, and in this embodiment, the difference set between the current and last co-domain name link sets may be links in the current and not last co-domain name link sets. For example, the links in the last co-domain name link set have (a, b, c), and the links in the current co-domain name link set have (a, c, d, e), then the links d and e are links that exist in the current co-domain name link set and are not in the last co-domain name link set, and then the links d and e can be obtained as newly added news links at the current extraction time.
The difference set between the current and last same domain name link sets may be regarded as a newly added link in the news website homepage in the period from the last extraction time (the time immediately before the above extraction time) to the current extraction time (the above extraction time), and in the news website, the newly added same domain name link may be regarded as a newly added news link which can jump to a news page in real time, so that the newly added homepage news link at the extraction time may be determined according to the difference set.
Step S307, the newly added homepage news links at the current extraction time are output as homepage news link extraction results of the homepage of the news website.
A new homepage news link at each extraction time can be calculated and output at the extraction time, that is, with the link data extraction according to the homepage link, the homepage news link extraction result of the homepage of the news website can be continuously output, and the homepage news link extraction result corresponds to each link data extraction and is the new homepage news link obtained by processing after each link data extraction.
In some practical applications, a time period for statistics may be preset, and then the newly added homepage news links at all the extraction moments in the time period are used as the homepage news link extraction results of the homepage of the news website together for output.
For example, assuming that the first preset number is 2 every 6 hours, and the time period for statistics is 7 days, there are 28 extraction times in the 7 days, and a set of links to the same domain name is extracted at each of the 28 extraction times; assuming that the first preset number is 2, 27 difference sets may be calculated for the 28 extraction times, that is, except for the first extraction time in the 7 days, there is no new homepage news link, each of the remaining 27 extraction times corresponds to a calculated result of the new homepage news link, and the results of the 27 calculated new homepage news links are combined, and the homepage news link extraction result of the news web site homepage in the 7 days may be obtained and output.
Therefore, by the method in the embodiment of the disclosure, newly added homepage news links at each extraction time can be automatically and efficiently obtained, so that homepage news link extraction results of the homepage of the news website can be continuously output.
In some embodiments, after obtaining the set of co-domain name links at each extraction time, the non-news link extraction result of the homepage of the news website homepage may be further obtained according to the set of co-domain name links at each extraction time.
Fig. 4 is a flowchart showing a method for obtaining a homepage non-news link extraction result in the news link extraction method according to an embodiment of the present disclosure. As shown in fig. 4, in some embodiments, "link extraction according to homepage links, obtaining homepage non-news link extraction results of a news website homepage" in step S203 may include the following steps.
Step S401, determining a plurality of extraction moments based on a preset extraction frequency, and obtaining a same domain name link set of a news website homepage at each extraction moment according to homepage links.
The execution mode of this step is similar to that of step S301; alternatively, this step may be the same step as step S301, that is, this step may multiplex the results obtained in step S301 (i.e., the same domain name link set of the news website homepage at each extraction time) to perform the subsequent steps.
Step S403, at the current extraction time, obtaining a second preset number of links set with the same domain name by using a preset non-news extraction sliding window; the second preset number of links with the same domain name corresponds to the current extraction time and the extraction time before the current extraction time.
The second preset number may be a window size of a non-news extraction sliding window. In some embodiments, the second preset number may be greater than the first preset number, e.g., the first preset number is assumed to be 2, and the second preset number may be 32, 36, 40, etc.; taking the example of extracting the set of same domain name links every 6 hours as an example, the second preset number of 36 means that the set of same domain name links at all extraction times (36 extraction times) within the past 9 days are processed at once.
Step S405, obtaining the homepage non-news links at the current extraction moment according to a second preset number of links set with the same domain name.
In some embodiments, step S403 may further include: determining intersections of a second preset number of links to the domain name; and determining the non-news links of the homepage at the current extraction moment according to the links in the intersection.
Links obtained by intersecting a second preset number (e.g., 36) of links with the same domain name can be considered as links with the same domain name which continuously exist on the homepage of the news website all the time in a time period covered by the second preset number of extraction moments; obviously, the update speed of the links with the same domain name which are not removed at all is low, and the link has no real-time property of news, so that the links can be regarded as non-news links of the homepage in the homepage of the news website.
Step S407, outputting the non-news links of the homepage at the current extraction time as the non-news links of the homepage of the news website.
A homepage non-news link at each extraction time may be calculated for output at the extraction time, that is, with link data extraction according to the homepage link, homepage non-news link extraction results of the homepage of the news website may be continuously output, and the homepage non-news link extraction results correspond to each link data extraction and are homepage non-news links obtained by processing after each link data extraction.
Therefore, by the method in the embodiment of the disclosure, the homepage non-news links determined at each extraction moment can be automatically and efficiently obtained, so that the homepage non-news link extraction result of the homepage of the news website can be continuously output.
Fig. 5 shows a flowchart for obtaining a homepage news link update speed in a news link extraction method according to an embodiment of the present disclosure. As shown in fig. 5, in some embodiments, "link data extraction according to homepage links, obtaining the homepage news link update speed of the homepage of the news website" in step S203 may include the following steps.
Step S501, a preset period is acquired, and the total number of times of extraction at the extraction time in the preset period is determined.
The preset period may be configured based on actual requirements, or may be a fixed configuration value, which is not limited in this disclosure. For example, the preset period may be the total duration of operation of the system configured with the present method, or may be a fixed 180 days, 60 days, or the like.
For example, in the case of extraction once every 6 hours, assuming that the system has been operated for 180 days, the total number of extractions is 720; and as the system operates, new linked data fetches will take place as the time reaches the next fetch time, so the total number of fetches may be increasing over time.
Step S503, determining total news links of the newly added homepage according to the news links of the newly added homepage at all the extraction moments in the preset period.
Step S505, determining the updating speed of the news links of the news website homepage in the preset period according to the total quantity of the news links of the newly added homepage and the total extraction times.
After the preset period is determined, the average newly increased number of the homepage news links in the period can be calculated and output as the updating speed of the homepage news links. That is, with link data extraction according to the homepage link, the homepage news link update speed of the homepage of the news website can be continuously output.
Step S205, judging whether a news plate page exists in the homepage of the news website according to the homepage non-news link extraction result and the homepage news link update speed.
The links in the non-news link extraction result of the homepage may be considered as links that are not updated frequently, and in practical application, the links may be news slab links whose data in the page are updated frequently, or other links whose data in the page are updated infrequently (such as bulletin page links, etc.), so that the non-news link extraction result of the homepage may be determined to determine whether a news slab page exists in the homepage of the news website.
In some practical applications, a news link extractor may be configured to extract link data according to the homepage links, and obtain homepage news link extraction results, homepage non-news link extraction results, and homepage news link update speed of the homepage of the news website.
Fig. 6 is a flowchart illustrating a method for extracting a news link according to an embodiment of the present disclosure for determining whether a news slab page exists in a news website homepage. As shown in fig. 6, in some embodiments, step S205 may include the following steps.
Step S601, extracting link data according to each homepage non-news link in the homepage non-news link extraction result, and obtaining the sub-page news update speed of the sub-page pointed by each homepage non-news link.
Wherein, the manner of obtaining the sub page news update speed of the sub page pointed by the non news links of each homepage can be similar to the manner of obtaining the news link update speed of the homepage; the news link extractor may also be used to output the news update rate for each sub-page, which is not described in detail herein.
In step S603, if there is a target sub page in the sub page, where the ratio of the sub page news update speed to the homepage news link update speed is greater than the update threshold, it is determined that there is a news slab page in the news website homepage, and the target sub page is determined as a news slab page.
The update threshold may be set based on actual situations or actual demands, may be obtained based on machine learning labels, or may be obtained based on known news link update speeds of the homepage and news link update speeds in the news slab page through statistical calculation, which is not limited in the present disclosure. For example, the speed ratio of the news link update speed of the sub-pages to the homepage in each news website can be calculated according to a plurality of news websites, then, which of the sub-pages are news plate pages and which are not, and further, a limit value is determined as an update threshold.
The meaning of updating the threshold may be: if above this value, the update rate of the news links in the corresponding sub-page is considered to have reached the level of the news link update rate, which should also be considered to be a news slab page.
In some practical applications, the update threshold may be 10%, which may be the result of integrating the recall ratio and the precision ratio in practical applications, where the recall ratio reflects the total imperfection of finding a tile, whether missing is present, and the precision ratio reflects that in the links determined to be news tile pages, the proportion of news tile pages is actually not high enough.
Step S207, in the case of the news plate page, the plate page news link extraction result in the news plate page is extracted.
Wherein, the manner of extracting the news link extraction result of the plate page in the news plate page can be similar to the manner of obtaining the news link extraction result of the homepage; the news link extractor may also be used to output the news link extraction results for the tile page in the news tile page, which will not be described in detail herein.
In some practical applications, a plate link extractor may be configured to determine whether a news plate page exists in a news website homepage, and automatically extract a plate page news link extraction result in the news website homepage if the news plate page exists; further, in the case of existence, the links of the news slab pages can be automatically input to the news link extractor, and the slab page news link extraction result in the news slab pages can be output.
Step S209, determining the website news link extraction result of the news website homepage according to the homepage news link extraction result and the plate page news link extraction result.
The obtained homepage news link extraction result and the plate page news link extraction result can be automatically output together as a final news link extraction result.
According to the news link extraction method provided by the disclosure, link data extraction can be firstly carried out on a news website homepage to obtain a homepage news link extraction result, a homepage non-news link extraction result and a homepage news link update speed of the homepage, then whether a news plate page exists in the news website homepage is further judged according to the homepage non-news link extraction result and the homepage news link update speed, and then the plate page news link extraction result in the news plate page is extracted under the condition that the news plate page exists, and further the homepage news link extraction result and the plate page news link extraction result are used together as a final website news link extraction result of the news website homepage to be output. The method can automatically and efficiently obtain the news links of newly added homepage at each extraction moment, rapidly judge whether news plate pages exist in the homepage of the news website, and then automatically and efficiently obtain the news link extraction results of plate pages in the news plate pages; therefore, the method not only can quickly obtain the news links in the news website homepage, but also can mine the news plate pages in the news website homepage and quickly obtain the news links in the news plate pages, so that all website news link extraction results in the news website homepage can be continuously, efficiently and comprehensively output.
In some practical applications, two components, "news link extractor" and "plate link extractor", may be set in the news link extraction method provided in the present disclosure, and the news link extraction method is implemented by using the two components.
Fig. 7 shows a schematic diagram of a news link extraction method according to an embodiment of the present disclosure. As shown in fig. 7, a news link extractor 701 and a tile link extractor 702 are included.
Wherein the input of the news link extractor 701 may be a page link 700 (e.g., homepage link, plate link) containing a news link; the output of the news link extractor 701 may be a news link (i.e., homepage news link extraction result), a non-news link (i.e., homepage non-news link extraction result), and an update rate of the news link (i.e., homepage news link update rate).
The inputs to the tile link extractor 702 may be non-news links and the speed of the home page news link update, and the output of the tile link extractor 702 may be a determined tile link (i.e., a link to a news tile page).
In the initial state, the homepage link of the news website (i.e., the homepage link of the homepage of the news website) can be obtained first as the initial input of the whole news link extraction method flow. Homepage links may be input to news link extractor 701 first, outputting homepage news links 703, non-news links 704, and news link update rate 705 on the homepage.
Wherein non-news links 704 and news link update speed 705 may be input to the tile link extractor 702, and after the tile links 706 are extracted, the tile links may be input again to the news link extractor 701, and the tile page news links 707 in the page to which the tile links are directed may be output from the news link extractor.
Finally, the home page news link 703 and the tile page news link 707 may be used together as the web site news link extraction result 708 in the final news web site.
Fig. 8 shows a schematic diagram of a news link extractor in a news link extraction method according to an embodiment of the present disclosure. As shown in fig. 8, the news link extractor 800 may have one input parameter and three output parameters. The input parameter may be a page link 801 to be collected, such as a homepage link, a plate page link, and a page pointed to by the links includes a large number of news links. Three output parameters are provided, the first output parameter is a news link 802, and the first output parameter can be a component of the final website news link extraction result; the second output parameter is the non-news link 803. The third output parameter is news link update speed 804.
The design of news link extractor 800 requires two sliding windows, one is a news extraction sliding window 805, the sliding window size of which is 2, which can be used to extract news links, and the other is a non-news extraction sliding window 806, the sliding window size of which is a larger value β (e.g., 36), which can be used to extract non-news links. For incoming page links 801, all links on the page may be collected once every other fixed time (e.g., 6 hours) and other domain name links filtered out, which results in a sequence element (i.e., a set of links to the same domain name).
As shown in fig. 8, M is the current time. The news extraction sliding window 805 is used to continuously slide backward over time, and processes the two nearest sequence elements including the current extraction time, specifically, the processing operation is to calculate a difference set between the two sequence elements, and output the difference set as a new news link 802.
Using the non-news extraction sliding window 806 to also slide back over time, processing the latest β sequence elements that contain the current extraction time, can take an intersection of all the link sets in the non-news extraction sliding window 806 and then take this intersection as the non-news link 803 for output.
In this embodiment, two variables are also maintained in the news link extractor 800, and the variables are used for recording the total number of news links collected by each page (i.e. the total number of news links in the newly added homepage) and the total number of collection rounds (i.e. the total number of times of extraction), and dividing the total collection number by the collection rounds, so as to obtain the news link update speed 804 of the page for output.
Figure 9 illustrates a schematic diagram of a tile link extractor in a news link extraction method according to one embodiment of the present disclosure. As shown in fig. 9, the tile link extractor 900 may have two input parameters and one output parameter. Wherein the input parameters may be a non-news link 901 of the homepage and a news link update speed 902 of the homepage, and the output parameters may be a tile link 903. The operation flow is as follows:
all non-news links 901 are input to a news link extractor, and the news link update speed 904 (i.e., sub-page news update speed) of the page to which each non-news link 901 points is output through the news link extractor; the page news link update rate 904 is compared to the news link update rate 902 of the incoming homepage, and if the comparison exceeds the update threshold (e.g., 10%), the page can be considered a plate page and the link to the page can be output as plate link 903.
It is noted that the above-described figures are only schematic illustrations of processes involved in a method according to an exemplary embodiment of the invention, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.
FIG. 10 illustrates a block diagram of a news link extraction device 1000 in accordance with one embodiment of the present disclosure; as shown in fig. 10, includes: an acquisition module 1001, configured to acquire a homepage link of a news website homepage; the homepage link data extraction module 1002 is configured to extract link data according to homepage links, and obtain a homepage news link extraction result, a homepage non-news link extraction result, and a homepage news link update speed of a homepage of the news website; a judging module 1003, configured to judge whether a news plate page exists in the news website homepage according to the homepage non-news link extraction result and the homepage news link update speed; a plate link extraction module 1004, configured to extract a plate page news link extraction result in a news plate page when the news plate page exists; the determining module 1005 is configured to determine a website news link extraction result of the news website homepage according to the homepage news link extraction result and the plate page news link extraction result.
Through the news link extraction device provided by the disclosure, link data extraction can be firstly carried out on a news website homepage to obtain a homepage news link extraction result, a homepage non-news link extraction result and a homepage news link update speed of the homepage, then whether a news plate page exists in the news website homepage is further judged according to the homepage non-news link extraction result and the homepage news link update speed, and the plate page news link extraction result in the news plate page is extracted again under the condition that the news plate page exists, so that the homepage news link extraction result and the plate page news link extraction result are used together as a final website news link extraction result of the news website homepage to be output. The method can automatically and efficiently obtain the news links of newly added homepage at each extraction moment, rapidly judge whether news plate pages exist in the homepage of the news website, and then automatically and efficiently obtain the news link extraction results of plate pages in the news plate pages; therefore, the method not only can quickly obtain the news links in the news website homepage, but also can mine the news plate pages in the news website homepage and quickly obtain the news links in the news plate pages, so that all website news link extraction results in the news website homepage can be continuously, efficiently and comprehensively output.
In some embodiments, the determining module 1003 determines whether a news slab page exists in the news website homepage according to the homepage non-news link extraction result and the homepage news link update speed, including: extracting link data according to each homepage non-news link in the homepage non-news link extraction result to obtain the sub-page news update speed of the sub-page pointed by each homepage non-news link; if the target sub page with the ratio of the sub page news update speed to the homepage news link update speed being larger than the update threshold exists in the sub pages, determining that a news plate page exists in the homepage of the news website, and determining the target sub page as the news plate page.
In some embodiments, the homepage link data extraction module 1002 performs link data extraction according to homepage links to obtain homepage news link extraction results of a homepage of a news website, including: determining a plurality of extraction moments based on a preset extraction frequency, and obtaining a same domain name link set of a news website homepage at each extraction moment according to homepage links; at the current extraction moment, a first preset number of links sets with the same domain name are obtained by utilizing a preset news extraction sliding window; the first preset number of the same domain name link sets correspond to the current extraction time and the extraction time before the current extraction time; obtaining newly added homepage news links at the current extraction moment according to a first preset number of links with the same domain name; and outputting the newly added homepage news links at the current extraction moment as homepage news link extraction results of the homepage of the news website.
In some embodiments, the first preset number is 2; the first preset number of identical domain name link sets comprise a current identical domain name link set at the current extraction time and a last identical domain name link set at the last time of the current extraction time; the homepage link data extraction module 1002 obtains a newly added homepage news link at the current extraction time according to a first preset number of links with the same domain name, including: determining a difference set between the current same domain name link set and the last same domain name link set; and determining a new homepage news link at the current extraction moment according to the links in the difference set.
In some embodiments, the homepage link data extraction module 1002 performs link extraction according to homepage links to obtain homepage non-news link extraction results of a homepage of a news website, including: at the current extraction moment, a second preset number of links sets with the same domain name are obtained by utilizing a preset non-news extraction sliding window; the second preset number of the same domain name link sets corresponds to the current extraction time and the extraction time before the current extraction time; acquiring homepage non-news links at the current extraction moment according to a second preset number of links with the same domain name; and outputting the non-news links of the homepage at the current extraction moment as the non-news links of the homepage of the news website.
In some embodiments, the second preset number is greater than the first preset number; the homepage link data extraction module 1002 obtains the homepage non-news link at the current extraction time according to a second preset number of links with the same domain name, including: determining intersections of a second preset number of links to the domain name; and determining the non-news links of the homepage at the current extraction moment according to the links in the intersection.
In some embodiments, the homepage link data extraction module 1002 performs link data extraction according to homepage links to obtain a homepage news link update rate of a news website homepage, including: acquiring a preset time period containing the current extraction time, and determining the total extraction times of the extraction time in the preset time period; determining total content of the newly added homepage news links according to the newly added homepage news links at all the extraction moments in a preset period; and determining the main page news link updating speed of the main page of the news website in a preset period according to the total quantity of the newly added main page news links and the total extraction times.
Other content of the embodiment of fig. 10 may be referred to the other embodiments described above.
Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.
Fig. 11 shows a block diagram of a news link extraction computer device in an embodiment of the disclosure. It should be noted that the illustrated electronic device is only an example, and should not impose any limitation on the functions and application scope of the embodiments of the present invention.
An electronic device 1100 according to this embodiment of the invention is described below with reference to fig. 11. The electronic device 1100 shown in fig. 11 is merely an example, and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 11, the electronic device 1100 is embodied in the form of a general purpose computing device. Components of electronic device 1100 may include, but are not limited to: the at least one processing unit 1110, the at least one memory unit 1120, a bus 1130 connecting the different system components, including the memory unit 1120 and the processing unit 1110.
Wherein the storage unit stores program code that is executable by the processing unit 1110 such that the processing unit 1110 performs steps according to various exemplary embodiments of the present invention described in the above-described "exemplary methods" section of the present specification. For example, the processing unit 1110 may perform the method as shown in fig. 2.
The storage unit 1120 may include a readable medium in the form of a volatile storage unit, such as a Random Access Memory (RAM) 11201 and/or a cache memory 11202, and may further include a Read Only Memory (ROM) 11203.
The storage unit 1120 may also include a program/utility 11204 having a set (at least one) of program modules 11205, such program modules 11205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
The bus 1130 may be a local bus representing one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a bus using any of a variety of bus architectures.
The electronic device 1100 may also communicate with one or more external devices 1200 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 1100, and/or any devices (e.g., routers, modems, etc.) that enable the electronic device 1100 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 1150. Also, electronic device 1100 can communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 1160. As shown, network adapter 1160 communicates with other modules of electronic device 1100 via bus 1130. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 1100, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible embodiments, the various aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the "exemplary methods" section of this specification, when said program product is run on the terminal device.
A program product for implementing the above-described method according to an embodiment of the present invention may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may be run on a terminal device such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
According to one aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations of the above-described embodiments.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims (10)

1. A method for extracting a news link, comprising:
acquiring a homepage link of a news website homepage;
extracting link data according to the homepage links to obtain homepage news link extraction results, homepage non-news link extraction results and homepage news link update speed of the homepage of the news website;
Judging whether a news plate page exists in the homepage of the news website according to the homepage non-news link extraction result and the homepage news link updating speed;
extracting a plate page news link extraction result in a news plate page under the condition that the news plate page exists;
and determining the website news link extraction result of the news website homepage according to the homepage news link extraction result and the plate page news link extraction result.
2. The method of claim 1, wherein determining whether a news tile page exists in the news website homepage based on the homepage non-news link extraction result and a homepage news link update rate comprises:
extracting link data according to each homepage non-news link in the homepage non-news link extraction result to obtain the sub-page news update speed of the sub-page pointed by each homepage non-news link;
if a target sub-page with the ratio of the sub-page news update speed to the homepage news link update speed being larger than an update threshold exists in the sub-pages, determining that a news plate page exists in the news website homepage, and determining the target sub-page as a news plate page.
3. The method according to claim 1 or 2, wherein the link data extraction according to the homepage link to obtain a homepage news link extraction result of the news website homepage comprises:
determining a plurality of extraction moments based on a preset extraction frequency, and obtaining a same domain name link set of the news website homepage at each extraction moment according to the homepage links;
at the current extraction moment, a first preset number of links sets with the same domain name are obtained by utilizing a preset news extraction sliding window; the first preset number of the same domain name link sets corresponds to the current extraction time and the extraction time before the current extraction time;
obtaining a newly added homepage news link at the current extraction moment according to the first preset number of links with the same domain name;
and outputting the newly added homepage news links at the current extraction moment as homepage news link extraction results of the homepage of the news website.
4. A method according to claim 3, wherein the first preset number is 2; the first preset number of identical domain name link sets comprise a current identical domain name link set at the current extraction moment and a last identical domain name link set at a moment previous to the current extraction moment;
The obtaining the newly added homepage news link at the current extraction moment according to the first preset number of links with the same domain name comprises the following steps:
determining a difference set between the current same domain name link set and the last same domain name link set;
and determining a new homepage news link at the current extraction moment according to the links in the difference set.
5. A method according to claim 3, wherein the link extraction based on the homepage links to obtain a homepage non-news link extraction result of the news website homepage comprises:
at the current extraction moment, a second preset number of links sets with the same domain name are obtained by utilizing a preset non-news extraction sliding window; wherein the second preset number of links to the same domain name corresponds to the current extraction time and the extraction time before the current extraction time;
acquiring homepage non-news links at the current extraction moment according to the second preset number of links with the same domain name;
and outputting the non-news links of the homepage at the current extraction moment as the non-news links of the homepage of the news website.
6. The method of claim 5, wherein the second predetermined number is greater than the first predetermined number;
The obtaining the homepage non-news links at the current extraction moment according to the second preset number of links with the same domain name comprises the following steps:
determining intersections of the second preset number of links to the domain name;
and determining the non-news links of the homepage at the current extraction moment according to the links in the intersection.
7. A method according to claim 3, wherein obtaining a homepage news link update rate of the news website homepage based on the homepage link extracting link data comprises:
acquiring a preset time period containing the current extraction time, and determining the total extraction times of the extraction time in the preset time period;
determining the total quantity of the news links of the newly added homepage according to the news links of the newly added homepage at all the extraction moments in the preset period;
and determining the main page news link updating speed of the main page of the news website in the preset period according to the total amount of the news links of the newly added main page and the total extraction times.
8. A news link extraction apparatus, comprising:
the acquisition module is used for acquiring homepage links of the homepage of the news website;
the homepage link data extraction module is used for extracting link data according to the homepage links to obtain homepage news link extraction results, homepage non-news link extraction results and homepage news link update speed of the homepage of the news website;
The judging module is used for judging whether a news plate page exists in the news website homepage according to the homepage non-news link extraction result and the homepage news link updating speed;
the plate link extraction module is used for extracting a plate page news link extraction result in the news plate page under the condition that the news plate page exists;
and the determining module is used for determining the website news link extraction result of the news website homepage according to the homepage news link extraction result and the plate page news link extraction result.
9. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a news link extraction method according to any one of claims 1 to 7.
10. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which when executed by the one or more processors cause the one or more processors to implement the news link extraction method of any one of claims 1 to 7.
CN202310931750.XA 2023-07-26 2023-07-26 News link extraction method and device, storage medium and electronic equipment Pending CN116955869A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310931750.XA CN116955869A (en) 2023-07-26 2023-07-26 News link extraction method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310931750.XA CN116955869A (en) 2023-07-26 2023-07-26 News link extraction method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN116955869A true CN116955869A (en) 2023-10-27

Family

ID=88448686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310931750.XA Pending CN116955869A (en) 2023-07-26 2023-07-26 News link extraction method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN116955869A (en)

Similar Documents

Publication Publication Date Title
CN110262807B (en) Cluster creation progress log acquisition system, method and device
US20130173655A1 (en) Selective fetching of search results
EP3916584A1 (en) Information processing method and apparatus, electronic device and storage medium
CN111198859B (en) Data processing method, device, electronic equipment and computer readable storage medium
US20130246520A1 (en) Recognizing Social Media Posts, Comments, or other Texts as Business Recommendations or Referrals
CN112860662B (en) Automatic production data blood relationship establishment method, device, computer equipment and storage medium
CN110929128A (en) Data crawling method, device, equipment and medium
CN111538645B (en) Data visualization method and related equipment
CN111552895B (en) Page route analysis method, system, equipment and medium in applet application
US10042824B2 (en) Detection and elimination for inapplicable hyperlinks
CN115766184A (en) Webpage data processing method and device, electronic equipment and storage medium
US10931771B2 (en) Method and apparatus for pushing information
CN110716804A (en) Method and device for automatically deleting useless resources, storage medium and electronic equipment
US20170212663A1 (en) Capturing Intended Selection of Content Due to Dynamically Shifting Content
CN111414523A (en) Data acquisition method and device
CN116955869A (en) News link extraction method and device, storage medium and electronic equipment
CN110647331A (en) Development tool acquisition method and device, storage medium and electronic equipment
CN113590985B (en) Page jump configuration method and device, electronic equipment and computer readable medium
CN113590447B (en) Buried point processing method and device
CN115563423A (en) Data acquisition method and device, computer equipment and storage medium
CN112307324B (en) Information processing method, device, equipment and medium
CN114238335A (en) Buried point data generation method and related equipment thereof
CN111414186A (en) Firmware updating method, device, equipment and storage medium
US20150248499A1 (en) Optimized read/write access to a document object model
CN112565472A (en) Static resource processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination