CN108241674B - Method and device for extracting webpage release time - Google Patents

Method and device for extracting webpage release time Download PDF

Info

Publication number
CN108241674B
CN108241674B CN201611219056.1A CN201611219056A CN108241674B CN 108241674 B CN108241674 B CN 108241674B CN 201611219056 A CN201611219056 A CN 201611219056A CN 108241674 B CN108241674 B CN 108241674B
Authority
CN
China
Prior art keywords
time
updated
time set
webpage
earliest
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611219056.1A
Other languages
Chinese (zh)
Other versions
CN108241674A (en
Inventor
赵凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201611219056.1A priority Critical patent/CN108241674B/en
Publication of CN108241674A publication Critical patent/CN108241674A/en
Application granted granted Critical
Publication of CN108241674B publication Critical patent/CN108241674B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a method and a device for extracting webpage release time, wherein the method comprises the following steps: extracting a first time set from a webpage address of a target webpage of which the release time needs to be extracted, wherein the first time set comprises the time matched from the webpage address; acquiring the earliest reprinting time of the webpage content of the target webpage; extracting a second time set from the webpage content of the target webpage, wherein the second time set comprises the time matched from the webpage content; determining a publication time of the target web page based on the first time set, the earliest reprint time, and the second time set.

Description

Method and device for extracting webpage release time
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for extracting webpage release time.
Background
In the big data era, mass data needs to be acquired, and along with the improvement of government policy information transparence, it is very important to efficiently and accurately acquire various valuable data of government webpages. Obtaining the publishing time or updating time of the content of the government website is an important item.
In the prior art, after web page content is obtained according to a web crawler, regular matching of time format is performed on the web page content, then a matched character string is formatted into a date format, dates in which keywords such as "release time" or "update time" exist in a certain range in front of the character string are screened out, and the release time of the web page content is determined from the screened dates.
However, in the time extraction scheme in the prior art, if there is no matched date in the web page content, or there is no keyword indicating that the matched date is "release time" or "update time" before or after the matched date, or there are multiple release dates screened out, it is difficult to accurately determine the release time of the web page.
Disclosure of Invention
In view of the above, the present application is proposed to provide a method for overcoming or at least partially solving the technical problem of low accuracy of publication time extracted in the prior art.
The application provides a method for extracting webpage release time, which comprises the following steps:
extracting a first time set from a webpage address of a target webpage of which the release time needs to be extracted, wherein the first time set comprises the time matched from the webpage address;
acquiring the earliest reprinting time of the webpage content of the target webpage;
extracting a second time set from the webpage content of the target webpage, wherein the second time set comprises the time matched from the webpage content;
determining a publication time of the target web page based on the first time set, the earliest reprint time, and the second time set.
Preferably, the determining the publishing time of the target webpage based on the first time set, the earliest reprinting time and the second time set includes:
deleting the time later than the earliest reprinting time in the first time set to obtain an updated first time set;
deleting the time later than the earliest reprinting time in the second time set to obtain an updated second time set;
and if the updated first time set and the updated second time set are not empty, determining the release time of the target webpage based on the time in the updated first time set and the time in the updated second time set.
Preferably, the determining the publishing time of the target webpage based on the updated time in the first time set and the updated time in the second time set includes:
judging whether time matched with the time in the updated second time set exists in the updated first time set or not;
if so, deleting the time which is not matched with the time in the updated second time set in the updated first time set to obtain a third time set;
and sorting the time in the third time set based on the attribute priority of preset time and the attribute of the time in the third time set, and determining the time with the highest attribute priority of the time in the third time set as the publishing time of the target webpage. Preferably, the determining the publishing time of the target webpage based on the updated time in the first time set and the updated time in the second time set further includes:
and if the time matched with the time in the updated second time set does not exist in the updated first time set, sequencing the time in the updated first time set according to the time sequence, and taking the time with the highest sequencing as the release time of the target webpage.
Preferably, the method determines the publishing time of the target webpage based on the first time set, the earliest reprinting time and the second time set, and further includes:
and if the updated first time set is empty and the updated second time set is not empty, sequencing the time in the updated second time set based on the attribute priority of the preset time and the attribute of the time in the updated second time set, and determining the time with the highest attribute priority in the updated second time set as the release time of the target webpage.
Preferably, the method determines the publishing time of the target webpage based on the first time set, the earliest reprinting time and the second time set, and further includes:
and if the updated first time set and the updated second time set are both empty, taking the earliest reprinting time as the release time of the target webpage.
Preferably, the method determines the publishing time of the target webpage based on the first time set, the earliest reprinting time and the second time set, and further includes:
and if the updated first time set is not empty and the updated second time set is empty, sequencing the time in the updated first time set according to the time sequence, and taking the time with the highest sequencing as the release time of the target webpage.
The application also provides a device for extracting the webpage release time, which comprises:
the device comprises a first extraction unit, a second extraction unit and a third extraction unit, wherein the first extraction unit is used for extracting a first time set from a webpage address of a target webpage of which the release time needs to be extracted, and the first time set comprises the time matched from the webpage address;
a reprint obtaining unit, configured to obtain the earliest reprint time of the web page content of the target web page;
the second extraction unit is used for extracting a second time set from the webpage content of the target webpage, wherein the second time set comprises the time matched from the webpage content;
a time determining unit, configured to determine a publishing time of the target webpage based on the first time set, the earliest reprinting time, and the second time set.
The above apparatus, preferably, the time determination unit includes:
a first deleting subunit, configured to delete a time later than the earliest transfer time in the first time set, and obtain an updated first time set;
a second deleting subunit, configured to delete a time later than the earliest reprinting time in the second time set, to obtain an updated second time set;
and the comprehensive determination subunit is configured to determine, if neither the updated first time set nor the updated second time set is empty, the publishing time of the target webpage based on the time in the updated first time set and the time in the updated second time set.
The above apparatus, preferably, the comprehensive determination subunit includes:
the matching judgment module is used for judging whether time matched with the time in the updated second time set exists in the updated first time set or not, and if so, triggering the first determination module;
a first determining module, configured to delete a time in the updated first time set that does not match a time in the updated second time set, obtain a third time set, sort the times in the third time set based on a preset attribute priority of the time and an attribute of the time in the third time set, and determine a time with a highest attribute priority of the time in the third time set as a publishing time of the target web page. The above apparatus, preferably, the comprehensive determination subunit further includes:
and a second determining module, configured to sort the times in the updated first time set according to a time sequence if the matching determining module determines that there is no time in the updated first time set that matches the time in the updated second time set, and use the time before the sorting as the publishing time of the target webpage.
Preferably, the above apparatus, wherein the time determination unit further includes:
and the attribute determining subunit is configured to, if the updated first time set is empty and the updated second time set is not empty, sort the times in the updated second time set based on the attribute priority of the preset time and the attribute of the time in the updated second time set, and determine the time with the highest attribute priority in the updated second time set as the publishing time of the target web page.
Preferably, the above apparatus, wherein the time determination unit further includes:
and the reprint determining subunit is configured to, if the updated first time set and the updated second time set are both empty, use the earliest reprint time as the publishing time of the target webpage.
Preferably, the above apparatus, wherein the time determination unit further includes:
and the front and back determining subunit is configured to sort the time in the updated first time set according to a time sequence if the updated first time set is not empty and the updated second time set is empty, and use the time at the top of the sort as the publishing time of the target webpage.
By means of the technical scheme, the method and the device for extracting the webpage release time determine the release time of the target webpage by combining the time extracted from the webpage address, the time extracted from the webpage content and the earliest reprinting time, time interference information in the webpage address is less because the time in the webpage address is usually used as a request parameter, and the time extracted from the webpage content is accurately screened by combining the time in the webpage address, so that the accuracy rate of the obtained release time is far higher than that of the time extracted from the webpage content alone, and the purpose of the application is achieved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a flowchart of a method for extracting a webpage release time according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of an apparatus for extracting webpage publishing time according to a second embodiment of the present application;
fig. 3 is a schematic partial structural diagram of an apparatus for extracting webpage publishing time according to a second embodiment of the present application;
fig. 4 is another partial structural schematic diagram of the second embodiment of the present application.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Referring to fig. 1, a flowchart of a method for extracting webpage publishing time provided in an embodiment of the present application is suitable for extracting publishing time of a target webpage. In this embodiment, the method may include the steps of:
step 101: and determining a target webpage needing to extract the release time.
In this embodiment, a user may input a URL (Uniform Resource Locator) of a target webpage through a client, and the present embodiment determines the target webpage based on the webpage address.
The target web page may be various news web pages or notification web pages, etc., such as a "notification about a hygiene item check" web page, etc. The web pages can be published by government websites and can also be published by other application websites, that is, the embodiment is suitable for extracting the publishing time of the web pages on various websites.
Step 102: a first time set is extracted from a webpage address of a target webpage.
The first time set comprises at least one time, and the times are matched from the webpage addresses of the target webpage.
It should be noted that, in this embodiment, a web page address, that is, a web page URL, may be obtained from a target web page, then strings representing time are matched from the web page URL by using a regular expression, then the matched strings are converted into time strings, such as time strings, that is, the time in this application, and finally the times are put into a first time set, such as urldete.
Step 103: in the webpage data of the Internet, the earliest transferring time of the webpage content of the target webpage is obtained.
In this embodiment, the web crawler may first be used to obtain each large mainstream search website of the target web page appearing in the internet, and then the web crawler may be used to obtain the earliest reprint time, that is, the smallest reprint time, of the web content of the target web page on these websites, which may be represented by lastDate.
Step 104: and deleting the time later than the earliest transferring time in the first time set to obtain an updated first time set.
In this embodiment, the earliest reprinting time is used to screen the time in the first time set, that is, the target web page is reprinted by various websites from the time after being released, and the reprinted time is certainly later than, i.e. greater than, the actual release time of the target web page, so in this embodiment, the time later than the earliest reprinting time in the first time set is deleted, and only the time earlier than lastDate is left, and the first time set at this time is represented by RightDate, that is, the first time set is changed from UrlDate including the time in all URLs to RightDate only keeping the time earlier than the earliest reprinting time.
Step 105: a second set of times is extracted from the web page content of the target web page.
And the second time set comprises at least one time which is matched from the webpage content of the target webpage.
It should be noted that, in this embodiment, the regular expression may be used to match all the characters in the web page content of the target web page, and convert the matched character strings into character strings in a time format, such as, for example, year, month, and day.
Step 106: and deleting the time later than the earliest transferring time in the second time set to obtain an updated second time set.
In this embodiment, the earliest reprinting time is used to screen the time in the second time set, and the time later than the earliest reprinting time lastDate is eliminated, and only the time earlier than the lastDate is left.
Step 107: the respective times in the updated second set of times are sorted with respect to attribute priority.
Specifically, in this embodiment, the times in the second time set are sorted according to the weight priority of each attribute of the time.
In this embodiment, the time attribute may be understood as: the region attribute of the character string corresponding to the time in the character string region located in the web page content of the target web page, where the character string region may be a character string region in a certain range before the character string corresponding to the time, and the certain range may be a range of 10 characters or 5 character strings before the character string corresponding to the time.
Accordingly, the region attribute may include: the time keyword character string and the non-time character string contained in the character string area, and the position parameter of the character string area in the webpage content. That is, the attributes of time include: the time keyword character string and the non-time character string which are contained in the character string area range of the character string corresponding to the time in the webpage content, and the position of the character string area range of the character string corresponding to the time in the webpage content in the whole webpage content.
For example, the time keyword strings keywords may be: and if one or more keywords of the type are contained in the 10 front characters of the character string corresponding to the time, the attribute value is yes or true, and otherwise, the attribute value is not or false.
The non-time string banwords may be: and obvious non-time character string keywords such as a number, an address, a serial number and the like appearing in the 10 characters in front of the character string corresponding to time, wherein if the non-time character string keywords exist in the 10 characters in front of the character string corresponding to time, the attribute value is yes or true, and if the non-time character string keywords do not exist, the attribute value is not or false.
The position parameter position refers to: the position of the character string corresponding to the time in the text of the webpage content is the position of the whole text, such as the front 20% or the back 20% of the text, and the like, the attribute value 0 represents that the character string corresponding to the time appears in the character range of the front 20% of the text, 2 represents that the character string corresponding to the time appears in the character range of the back 20% of the text, and 1 represents that the character string corresponding to the time appears in the middle position. In most cases, the web page release time of the target web page appears under the body heading or at the end of the body.
In the above attributes, the attribute priority of the non-time character string is higher than the attribute priority of the time keyword character string, and the attribute priority of the time keyword character string is higher than the attribute priority of the position parameter;
in the non-time character string, the attribute priority with the attribute value of being non is higher than the attribute priority with the attribute value of being yes; in the time keyword character string, the attribute priority with the attribute value of yes is higher than the attribute priority with the attribute value of no; in the location parameter, the attribute priority with a low parameter value is higher than the attribute priority with a high parameter value.
For example, the above three attributes are set for the time string in the second time set, and are put into the AllDate set: the times in the second time set are screened according to the following priority:
1、banwords:false,keywords:true,position:0;
2、banwords:false,keywords:true,position:2;
3、banwords:false,keywords:false,position:0;
4、banwords:false,keywords:false,position:2;
5、banwords:true,keywords:true,position:0;
6、banwords:true,keywords:true,position:2;
7、banwords:true,keywords:false,position:0;
8、banwords:true,keywords:false,position:2;
that is, in this embodiment, the times in the second time set may be sorted based on the following filtering manner:
firstly, screening time with banwords as false, and sorting the time with banwords as true at the end; screening time with keywords as true from time with keywords as false or time with keywords as true, wherein the time with keywords as false is arranged at the end; screening out time with position 0 and position 2 from time with keywords true or time with keywords false, and sequencing time with position 1 at the end; of the time at position 0 and the time at position 2, the time at position 0 is ranked first and the time at position 2 is ranked last.
Finally, the sorting results of 1-8 are obtained, and the sorted second time set can be regarded as the perFinalDate.
Then, whether the updated first time set and the updated second time set are empty is judged, and the release time of the target webpage is further determined in a corresponding mode, wherein the method comprises the following steps:
step 108: and if the updated first time set and the updated second time set are not empty, determining the webpage release time of the target webpage based on the time in the first time set and the time in the second time set.
Specifically, it is first determined whether a time matching the time in the updated second time set exists in the updated first time set.
If so, deleting the time which is not matched with the time in the updated second time set in the updated first time set to obtain a third time set, namely, leaving the time which is matched with the time in the updated second time set in the updated first time set as the third time set; thereafter, since the time in the third time set is the time set ordered based on the attribute priority, the time with the highest attribute priority of the time may be determined as the publishing time of the target web page. That is, in the present embodiment, the time in the updated first time set is verified by using the time in the updated second time set, so as to determine the publishing time of the target webpage. Of course, step S107 may also be omitted, and the times in the second time set are not sorted according to the attribute priority, but after a third time set, that is, a set of times in the first time set that match the second time set, is obtained, the times in the third time set are sorted based on the attribute priority of the preset time and the attribute of the time in the third time set, and the time with the highest attribute priority of the time in the third time set is determined as the publishing time of the target web page.
And if the time matched with the time in the updated second time set does not exist in the updated first time set, sequencing the time in the updated first time set according to the time sequence, and taking the time with the highest sequencing as the release time of the target webpage. That is, in the present embodiment, when the time in the updated first time set is verified by the updated second time set and the verification is unsuccessful, the earliest time is selected as the delivery time from the first time set after the earliest reprinting time is removed.
Or, in this embodiment, the time in the updated first time set may be matched with each time in the updated second time set one by one, if there is a time in the second time set that matches the time in the first time set, the matched time is screened out to form a third time set, then, the earliest time in the third time set is determined as the publishing time of the target web page, and if there is no time in the second time set that matches the time in the first time set, the earliest time in the first time set is selected as the publishing time of the target web page.
It should be noted that the operation of sorting the time according to the attribute priority may be performed by sorting the time in the second time set according to the attribute priority before the time matching is performed on the first time set and the second time set, as shown in fig. 1, or may be performed by only sorting the time in the third time set obtained by matching according to the attribute priority after the time matching is performed on the first time set and the time in the second time set, where the implementation of the technical scheme in this embodiment is not affected before and after the sorting operation, and both implementations are within the protection scope of the present application.
For example: if the RightDate is not empty and the perFinalDate is not empty, matching character strings of the perFinalDate with each item in the RightDate to obtain character strings of a plurality of times, wherein the character strings are character strings in the second time set which can be matched with any time in the first time set and are also character strings which are sorted according to the attribute priority, and at this time, the time of the character string sorted in the first of the character strings is determined as the publishing time of the target webpage.
Step 109: and if the updated first time set is empty and the updated second time set is not empty, determining the time ordered in the first time set in the updated second time set as the publishing time of the target webpage.
Step 110: and if the updated first time set and the updated second time set are both empty, taking the earliest reprinting time as the release time of the target webpage.
Step 111: and if the updated first time set is not empty and the updated second time set is empty, selecting the earliest time in the updated first time set as the publishing time of the target webpage.
It should be noted that, if the updated first time set is not empty and the updated second time set is empty, it indicates that the time is not matched from the web page content, but only the time is matched from the web page address, at this time, the time matched from the web page address is usually one, and at this time, the time in the first time set is directly taken as the publishing time of the target web page. And when two or more times exist in the updated first time set, the time with the highest time sequence is taken as the publishing time of the target webpage.
Examples of the present embodiments in particular implementations are described below:
step1, matching character strings from a target webpage, namely a given webpage url by using a regular expression, converting the character strings into time character strings xxxx year xx month xx day, and placing the time character strings into a set UrlDate;
step2, obtaining the earliest reloading time lastDate of the target webpage on each mainstream search website according to the web crawler, comparing the time with the time in UrlDate, and leaving the time earlier than the lastDate and placing the time in Right Date;
step3, using a regular expression to match all characters in the web page content, converting the characters into character strings in a time format, and eliminating the time longer than lastDate;
step4, setting the following attributes keywords: boul, banwords: boul, position: {0,1,2}, for the time character string obtained in Step 3; put into the AllDate set.
Step5, the AllDate is screened, and the AllDate is selected according to the following priority order:
keywords:true,position:0
keywords:true,position:2
keywords:false,position:0
keywords:false,position:2
if any of the words is true, continuing to screen downwards; putting the screened character strings into PerfinalDate [ n1, n2, n3. ];
step6, if the RightDate is not empty and the perFinalDate is not empty, then using the character string of each item in the RightDate matching the perFinalDate to get [ a1, a2, a3... b1, b2, b3. ], then a1 is the final publishing time of the web page, and if the RightDate is empty, then n1 is the publishing time; if the perfilaldate is also empty, lastDate may be considered to be the release time.
It should be noted that, in Step2, the time character string in the url request is mostly the publishing time of the web page, and if the url request in Step1 has a digital character string just conforming to the rule, the digital character string is removed according to the earliest reprinting time, so that the publishing time of the web page can be basically determined;
all time character strings in the webpage text are matched according to the rule in Step3, the first way is that the time can be used as an alternative scheme for extracting the time when the url has no time, and the second way is that the accuracy of the time extraction in the url can be verified.
Step4 shows that keywords such as release time, update date, release date and the like exist in a certain range in front of the time character string, namely, true is set, and false is not set, banwords are obvious non-time character string keywords such as numbered address serial numbers and the like do not exist in a certain range in front of the time character string, true is set, false is not set, position: {0,1,2}, 0 represents that the time character string appears in a character range 20% in front of the text, 2 represents that the time character string appears in a character range 20% behind the text, 1 represents that the time character string appears in a middle position, in most cases, the release time appears under the title text or at the end of the text, and the keywords and the banwords are also marks capable of greatly improving the screening time.
Step5, removing the data which are later than the reprinting time, and then screening and sorting the data according to the 4 given priorities according to the attributes set in the front;
step6 verifies the release time obtained in Step5 by using the time acquired by the url, if the time is consistent, the weight is the highest, the release time is taken as the time, if the time in the url is not selected as the release time, the rule is that the time is correct under the condition of more than 90%, if the time in the url is not available, the time with the highest priority in Step5 is the release time, and if the time is not available, the earliest reprint time can be taken as the release time.
In this embodiment, a method of url and earliest reprinting time is combined, and the accuracy of the time screened out from url and earliest reprinting time is much higher than the time extracted from the text, because it is a common practice to use time in url as a request parameter, time interference information in two urls is little, and the accuracy is further improved.
And each item of time extracted in the text is further screened by setting keywords, banwords and position attributes, the screening accuracy is higher than that of the original scheme, and the method can also be used for verifying the accuracy of the time extracted in the url.
Further, the accuracy can be further improved by carrying out weight sequencing on the text time according to the three attributes.
Meanwhile, in the embodiment, the url time and the text time are judged and verified, the url time is preferably adopted, the time with high weight in the text is selected, and if the url time and the text time do not exist, the earliest reprinting time is reasonably selected.
Referring to fig. 2, a schematic structural diagram of an apparatus for extracting webpage publishing time according to a second embodiment of the present application is shown, in this embodiment, the apparatus may include the following structures:
a first extracting unit 201, configured to extract a first time set from a webpage address of a target webpage for which publication time needs to be extracted, where the first time set includes time matched from the webpage address;
a reprint obtaining unit 202, configured to obtain the earliest reprint time of the web page content of the target web page;
a second extracting unit 203, configured to extract a second time set from the web content of the target web page, where the second time set includes a time matched from the web content;
a time determining unit 204, configured to determine a publishing time of the target webpage based on the first time set, the earliest reprinting time, and the second time set.
According to the above scheme, the device for extracting the webpage release time provided by the second embodiment of the present application determines the release time of the target webpage by combining the time extracted from the webpage address, the time extracted from the webpage content, and the earliest reprinting time, and since the time in the webpage address is usually used as a request parameter, time interference information in the webpage address is less, and the time extracted from the webpage content is accurately screened by combining the time in the webpage address, so that the accuracy of the obtained release time is far higher than the time extracted from the webpage text content alone, thereby achieving the purpose of the present embodiment.
Referring to fig. 3, a schematic structural diagram of an apparatus for extracting a webpage publishing time according to a second embodiment of the present application is provided, where the time determining unit 204 may be implemented by the following structure:
a first deleting subunit 301, configured to delete a time later than the earliest reprinting time in the first time set, to obtain an updated first time set;
a second deleting subunit 302, configured to delete a time later than the earliest reprinting time in the second time set, and obtain an updated second time set;
a comprehensive determination subunit 303, configured to determine, if neither the updated first time set nor the updated second time set is empty, the publishing time of the target web page based on the time in the updated first time set and the time in the updated second time set.
As shown in fig. 4, the comprehensive determination subunit 303 may specifically include the following structure:
a matching judgment module 401, configured to judge whether a time matching the time in the updated second time set exists in the updated first time set, if so, trigger a first determination module 402, and otherwise, trigger a second determination module 403;
a first determining module 402, configured to delete a time in the updated first time set that does not match a time in the updated second time set, obtain a third time set, sort the times in the third time set based on a preset attribute priority of the time and an attribute of the time in the third time set, and determine a time with a highest attribute priority of the time in the third time set as the publishing time of the target web page.
A second determining module 403, configured to rank the times in the updated first time set according to a time sequence, and use the time before the ranking as the publishing time of the target webpage.
An attribute determining subunit 304, configured to, if the updated first time set is empty and the updated second time set is not empty, sort the times in the updated second time set based on the attribute priority of the preset time and the attribute of the time in the updated second time set, and determine the time with the highest attribute priority in the updated second time set as the publishing time of the target web page.
The attribute of time refers to an area attribute of a character string area where a character string corresponding to time is located in the web page content of the target web page, and the area attribute includes: time keyword character strings and non-time character strings contained in the character string area, and position parameters of the character string area in the webpage content;
it should be noted that the priority of the non-time character string is higher than the priority of the time keyword character string, and the priority of the time keyword character string is higher than the priority of the location parameter;
in the non-time character string, the priority that the attribute value is not is higher than the priority that the attribute value is yes; in the time keyword character string, the priority of the attribute value being yes is higher than the priority of the attribute value being no; in the position parameters, the priority of low parameter values is higher than the priority of high parameter values.
A reprint determining subunit 305, configured to, if the updated first time set and the updated second time set are both empty, use the earliest reprint time as the publishing time of the target webpage.
A front-back determining subunit 306, configured to sort the times in the updated first time set according to a time sequence order if the updated first time set is not empty and the updated second time set is empty, and use the time at the top of the sort as the publishing time of the target webpage.
The device for extracting the webpage release time in this embodiment may include: the device comprises a processor and a memory, wherein the first extraction unit, the second extraction unit, the transfer obtaining unit, the time determination unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the accuracy of acquiring the webpage release time is improved by adjusting kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
In the device for extracting the webpage release time in the embodiment, by combining the url and the earliest reprinting time, the accuracy of the time screened out from the url and the earliest reprinting time is far higher than the time extracted from the text, because the time in the url is a common method as a request parameter, the time interference information in the url is less, and the accuracy is further improved. And each item of setting attribute of the time extracted in the text is further screened, the screening accuracy is higher than that of the original scheme, and the method can also be used for verifying the accuracy of the time extracted in the url. Further, the accuracy can be further improved by carrying out weight sequencing on the text time according to the three attributes. Meanwhile, in the embodiment, the url time and the text time are judged and verified, the url time is preferably adopted, the time with high weight in the text is selected, and if the url time and the text time do not exist, the earliest reprinting time is reasonably selected.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device: extracting a first time set from a webpage address of a target webpage of which the release time needs to be extracted, wherein the first time set comprises time matched from the webpage address; acquiring the earliest reprinting time of the webpage content of the target webpage; extracting a second time set from the webpage content of the target webpage, wherein the second time set comprises the time matched from the webpage content; and determining the release time of the target webpage based on the first time set, the earliest reprinting time and the second time set.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (8)

1. A method for extracting webpage release time is characterized by comprising the following steps:
extracting a first time set from a webpage address of a target webpage of which the release time needs to be extracted, wherein the first time set comprises the time matched from the webpage address;
acquiring the earliest reprinting time of the webpage content of the target webpage;
extracting a second time set from the webpage content of the target webpage, wherein the second time set comprises the time matched from the webpage content;
determining a publication time of the target web page based on the first time set, the earliest reprinting time, and the second time set, including:
deleting the time later than the earliest reprinting time in the first time set to obtain an updated first time set; deleting the time later than the earliest reprinting time in the second time set to obtain an updated second time set; and if the updated first time set and the updated second time set are not empty, determining the release time of the target webpage based on the time in the updated first time set and the time in the updated second time set.
2. The method of claim 1, wherein determining the publication time of the target web page based on the updated time in the first set of times and the updated time in the second set of times comprises:
judging whether time matched with the time in the updated second time set exists in the updated first time set or not;
if so, deleting the time which is not matched with the time in the updated second time set in the updated first time set to obtain a third time set;
and sorting the time in the third time set based on the attribute priority of preset time and the attribute of the time in the third time set, and determining the time with the highest attribute priority of the time in the third time set as the publishing time of the target webpage.
3. The method of claim 2, wherein determining the publication time of the target web page based on the updated time in the first set of times and the updated time in the second set of times further comprises:
and if the time matched with the time in the updated second time set does not exist in the updated first time set, sequencing the time in the updated first time set according to the time sequence, and taking the time with the highest sequencing as the release time of the target webpage.
4. The method of claim 1, wherein determining the publication time of the target web page based on the first set of times, the earliest reprint time, and the second set of times, further comprises:
and if the updated first time set is empty and the updated second time set is not empty, sequencing the time in the updated second time set based on the attribute priority of the preset time and the attribute of the time in the updated second time set, and determining the time with the highest attribute priority in the updated second time set as the release time of the target webpage.
5. The method of claim 1, wherein determining the publication time of the target web page based on the first set of times, the earliest reprint time, and the second set of times, further comprises:
and if the updated first time set and the updated second time set are both empty, taking the earliest reprinting time as the release time of the target webpage.
6. The method of claim 1, wherein determining the publication time of the target web page based on the first set of times, the earliest reprint time, and the second set of times, further comprises:
and if the updated first time set is not empty and the updated second time set is empty, sequencing the time in the updated first time set according to the time sequence, and taking the time with the highest sequencing as the release time of the target webpage.
7. An apparatus for extracting a web page publishing time, comprising:
the device comprises a first extraction unit, a second extraction unit and a third extraction unit, wherein the first extraction unit is used for extracting a first time set from a webpage address of a target webpage of which the release time needs to be extracted, and the first time set comprises the time matched from the webpage address;
a reprint obtaining unit, configured to obtain the earliest reprint time of the web page content of the target web page;
the second extraction unit is used for extracting a second time set from the webpage content of the target webpage, wherein the second time set comprises the time matched from the webpage content;
a time determining unit, configured to determine a publishing time of the target webpage based on the first time set, the earliest reprinting time, and the second time set;
the time determination unit includes:
a first deleting subunit, configured to delete a time later than the earliest transfer time in the first time set, and obtain an updated first time set;
a second deleting subunit, configured to delete a time later than the earliest reprinting time in the second time set, to obtain an updated second time set;
and the comprehensive determination subunit is configured to determine, if neither the updated first time set nor the updated second time set is empty, the publishing time of the target webpage based on the time in the updated first time set and the time in the updated second time set.
8. The apparatus of claim 7, wherein the comprehensive determination subunit comprises:
the matching judgment module is used for judging whether time matched with the time in the updated second time set exists in the updated first time set or not, and if so, triggering the first determination module;
a first determining module, configured to delete a time in the updated first time set that does not match a time in the updated second time set, obtain a third time set, sort the times in the third time set based on a preset attribute priority of the time and an attribute of the time in the third time set, and determine a time with a highest attribute priority of the time in the third time set as a publishing time of the target web page.
CN201611219056.1A 2016-12-26 2016-12-26 Method and device for extracting webpage release time Active CN108241674B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611219056.1A CN108241674B (en) 2016-12-26 2016-12-26 Method and device for extracting webpage release time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611219056.1A CN108241674B (en) 2016-12-26 2016-12-26 Method and device for extracting webpage release time

Publications (2)

Publication Number Publication Date
CN108241674A CN108241674A (en) 2018-07-03
CN108241674B true CN108241674B (en) 2021-11-02

Family

ID=62701431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611219056.1A Active CN108241674B (en) 2016-12-26 2016-12-26 Method and device for extracting webpage release time

Country Status (1)

Country Link
CN (1) CN108241674B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110119484B (en) * 2019-03-27 2021-04-06 湖南星汉数智科技有限公司 Webpage release time extraction method and device, computer device and computer readable storage medium
CN114547497A (en) * 2022-02-24 2022-05-27 马上消费金融股份有限公司 Method and device for determining webpage release time, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136305A1 (en) * 2005-12-14 2007-06-14 International Business Machines Corporation Method for synchronizing and updating bookmarks on multiple computer devices
CN104008213A (en) * 2014-06-24 2014-08-27 电子科技大学 Method and device for finding and counting webpage information updating
CN104182548A (en) * 2014-09-10 2014-12-03 北京国双科技有限公司 Webpage updating and processing method and device
CN104462151A (en) * 2013-09-25 2015-03-25 腾讯科技(深圳)有限公司 Method for evaluating web page publishing time and related device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070136305A1 (en) * 2005-12-14 2007-06-14 International Business Machines Corporation Method for synchronizing and updating bookmarks on multiple computer devices
CN104462151A (en) * 2013-09-25 2015-03-25 腾讯科技(深圳)有限公司 Method for evaluating web page publishing time and related device
CN104008213A (en) * 2014-06-24 2014-08-27 电子科技大学 Method and device for finding and counting webpage information updating
CN104182548A (en) * 2014-09-10 2014-12-03 北京国双科技有限公司 Webpage updating and processing method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
互联网上信息报道的最早发布时间检测;黄连恩等;《计算机科学与探索》;20090612;第51-59页 *

Also Published As

Publication number Publication date
CN108241674A (en) 2018-07-03

Similar Documents

Publication Publication Date Title
CN106649346B (en) Data repeatability checking method and device
JP5984917B2 (en) Method and apparatus for providing suggested words
TWI599899B (en) Method and apparatus for providing word recommendation
EP2812815B1 (en) Web page retrieval method and device
WO2014025811A2 (en) Method and apparatus of implementing navigation of product properties
CN108228799B (en) Object index information storage method and device
CN106776609B (en) Statistical method and device for website reprint quantity
CN108536745B (en) Shell-based data table extraction method, terminal, equipment and storage medium
CN111241389A (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
KR102024998B1 (en) Extracting similar group elements
US20230205755A1 (en) Methods and systems for improved search for data loss prevention
CN109743309B (en) Illegal request identification method and device and electronic equipment
CN106227893A (en) A kind of file type acquisition methods and device
CN108241674B (en) Method and device for extracting webpage release time
CN112818200A (en) Data crawling and event analyzing method and system based on static website
US20140149854A1 (en) Server and method for generating object document
CN106202349B (en) Webpage classification dictionary generation method and device
CN106815179B (en) Text similarity determination method and device
CN106682044B (en) Data processing method and device
JP6834774B2 (en) Information extraction device
CN103914479A (en) Resource request matching method and device
CN105224583B (en) Method and device for cleaning log files
CN104077555A (en) Method and device for identifying badcase in image search
CN106991117B (en) Snapshot processing method, snapshot display method, server, browser and system
CN105468688B (en) Site template processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant