CN106250512A - A kind of subject network information collecting method taking time intention into account - Google Patents

A kind of subject network information collecting method taking time intention into account Download PDF

Info

Publication number
CN106250512A
CN106250512A CN201610630419.4A CN201610630419A CN106250512A CN 106250512 A CN106250512 A CN 106250512A CN 201610630419 A CN201610630419 A CN 201610630419A CN 106250512 A CN106250512 A CN 106250512A
Authority
CN
China
Prior art keywords
theme
time
web page
url
page contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610630419.4A
Other languages
Chinese (zh)
Other versions
CN106250512B (en
Inventor
陈军
武昊
侯东阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NATIONAL GEOMATICS CENTER OF CHINA
Original Assignee
NATIONAL GEOMATICS CENTER OF CHINA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NATIONAL GEOMATICS CENTER OF CHINA filed Critical NATIONAL GEOMATICS CENTER OF CHINA
Priority to CN201610630419.4A priority Critical patent/CN106250512B/en
Publication of CN106250512A publication Critical patent/CN106250512A/en
Application granted granted Critical
Publication of CN106250512B publication Critical patent/CN106250512B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A kind of subject network information collecting method taking time intention into account, it collects sequence for carrying out the Internet web page information for subject events, it comprises the steps: step A, priori data is utilized to determine the initial time of subject events, and quantify its Annual distribution, obtain the quantized value of an Annual distribution;Step B, uses different method for expressing to be intended to the time in theme and general keyword is indicated respectively, and calculate time correlation degree and general keyword degree of association respectively;Step C, the time correlation degree calculated according to step B and general keyword degree of association, the quantized value of the described Annual distribution that structure obtains with the step A increasing function as variable, it is thus achieved that URL priority distribution formula based on Annual distribution quantized value, calculates final URL priority.A kind of subject network information collecting method taking time intention into account provided by the present invention, substantially increases webpage and finds quantity and precision ratio.

Description

A kind of subject network information collecting method taking time intention into account
Technical field
The present invention relates to internet web page search field, particularly obtain the theme of the webpage of certain content in the Internet and climb Row method, a kind of subject network information collecting method taking time intention into account.
Background technology
Topic crawling is to obtain a kind of key technology method of specific area webpage in the Internet, it is intended under as much as possible Carry the webpage relevant to designated key.The theme that it is mainly specified according to user, by calculating with degree of subject relativity, URL excellent First level distribution etc. is main crawl policy, constantly obtains the information of related web page from Ubiquitous Network resource.
URL priority distribution method based on web page contents is that traditional theme is creeped conventional method.It is mainly basis Two class relevance degrees are calculated, particularly as follows: (1) father web page contents degree of subject relativity: its value is the highest, and father's webpage is comprised URL priority is the highest;(2) Anchor Text degree of subject relativity: it refers to theme and Anchor Text, Anchor Text context and URL character The relevance degree of the information such as string, wherein the generality of content of pages pointed by URL is described by Anchor Text often.
In URL priority distribution method based on web page contents, father's web page contents degree of subject relativity and Anchor Text theme Degree of association calculates frequently with cosine formula, such as: father's web page contents degree of subject relativity of certain URL is sim (VDk,VTk), Anchor Text Degree of subject relativity is sim (VAk,VTk), then priority P riority (URL) of this URL can be calculated as follows:
Priority (URL)=θ × sim (VDk,VTk)+γ×sim(VAk,VTk) (1-1)
In above formula, θ and γ represents father's web page contents degree of subject relativity and the decay factor of Anchor Text degree of subject relativity respectively, And meet θ+γ=1.
When utilizing the emergency information of topic crawling method acquisition time sensitivity, time intention usually can be as theme A kind of limit key element.Regulation (2002) according to ISO19100 series standard, time object can be divided into " moment " and " time Section ", a wherein point in " moment " express time space;" period " is equivalent to a line in time and space, has starting point, end The attributes such as point and length.In general, on network, information dissemination about a certain accident mainly appears on event it occurs After, the issuing time i.e. reported should be later than the initial time of accident;On the other hand, there is Emergence and Development, change in accident The evolutionary process changed and wither away, in the different evolutionary phases, the temperature that people pay close attention to this event is the most different, preferentially downloads concern Spending the information of higher period, can meet most of Man's Demands, this reflects the Annual distribution of this event to a certain extent.Also That is, when utilizing theme to carry out network information gathering, it is relevant in information that the time is intended to (such as initial time and Annual distribution) Degree judges and INFORMATION DISCOVERY order of priority distribution aspect has obvious action.
Although filter house can be individually used for when utilizing topic crawling method collecting network information by setting initial time Point incoherent information, and its Annual distribution can affect the order of priority of INFORMATION DISCOVERY, but legacy network information collecting method Still simply paying close attention to the common semantic of theme, the time of analysis and utilization theme is not intended to, and there is asking of Annual distribution equalization Topic, causes its precision ratio low.It is embodied in:
(1) lack the method for expressing that the time is intended to: traditional unidirectional amount theme method for expressing it is merely meant that the key word of theme, The method for expressing that its time is intended to is not provided;
(2) effect of theme initial time is weakened: traditional theme relatedness computation strategy only relies on web page contents and judges Its dependency with theme, weakens the effect of theme initial time;
(3) ignore theme Annual distribution and affect the impact of INFORMATION DISCOVERY order of priority: traditional URL priority distribution method mesh Before mainly utilize web page contents, Anchor Text and context thereof, URL character string, the renewal time of linking relationship even webpage, but Have ignored the impact of theme Annual distribution.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of subject network information collecting method taking time intention into account, with Noted earlier problem is reduced or avoided.
For solving above-mentioned technical problem, the invention provides a kind of subject network information collection side taking time intention into account Method, it is for carrying out the Internet web page information collection sequence for subject events, and it comprises the steps:
Step A, utilizes priori data to determine the initial time of subject events, and quantifies its Annual distribution, when obtaining one Between distribution quantized value;
Step B, uses different method for expressing to be intended to the time in theme and general keyword is indicated respectively, and Calculate time correlation degree and general keyword degree of association respectively;
Step C, the time correlation degree calculated according to step B and general keyword degree of association, build with the acquisition of step A The quantized value of described Annual distribution is the increasing function of variable, and is dissolved into URL priority based on web page contents distribution Method, thus obtain URL priority distribution formula based on Annual distribution quantized value, calculate final URL priority, Also the URL allowing for the concerned moment obtains higher priority.
Preferably, the described priori data in step A is Google trend data.
Preferably, in step B, the expression way that the time in theme is intended to is as follows;
Theme and web page contents Formal Representation generally: given theme T and web page contents D, its table as follows Show.
T=< VTk,TST,TTD>
D=< VDk,TPT>
Wherein, VTk, TSTAnd TTDRepresent theme general vector, the beginning and ending time of theme and Annual distribution thereof respectively;VDkWith TPTRepresent general vector and the issuing time thereof of web page contents respectively.
The Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDAccording to equation below table Reach.
VTk={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)}
TST=[tSTs,tSTe]
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>}
Wherein, kiRepresent the i-th general keyword in theme;wTkiRepresent general keyword kiWeight;S represents theme The number of middle general keyword;tSTsRepresent the initial time of theme, tSTeRepresent the end time of theme, < [tTDsi,tTDei], λi >express time distribution in i-th<period, volumes of searches index>right;tTDsiAnd tTDeiBe respectively the i-th period initial time and End time, λiVolumes of searches exponential quantity for the i-th period;
The Formal Representation of web page contents: its general vector VDkWith issuing time TPTRepresent according to equation below.
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)}
TPT=tPT
Wherein, kiRepresent the i-th general keyword in web page contents;wDkiRepresent its general keyword kiWeight;tPT Represent the issuing time of webpage.
Preferably, in step B, the formula calculating time correlation degree and general keyword degree of association is as follows;
The time correlation degree calculating theme and web page contents is shown as follows:
s i m ( T P T , T S T ) = 0 t P T < t S T s 1 t S T s &le; t P T &GreaterEqual; t S T e
Wherein, sim (TPT,TST) represent theme and the time correlation angle value of web page contents;
The general subject degree of association calculating theme and web page contents is shown as follows:
s i m ( V D k , V T k ) = &Sigma; i = 1 s w T k i &times; w D k i &Sigma; i = 1 s w T k i 2 &times; &Sigma; i = 1 s w D k i 2
In formula, sim (VDk,VTk) represent theme T and web page contents D general subject relevance degree.
Preferably, the described URL priority distribution formula in step C is:
Wherein, PriorityT(URL) representing final URL priority, Priority (URL) is existing based on webpage The priority that the URL priority distribution method of content obtains, Pr (t/T) is the standardized value of Annual distribution quantized value, also illustrates that Issuing time is the webpage probability with theme T-phase pass of t;Described threshold value is in 0 to 1 interval value.
Preferably, described threshold value is set to 0.4.
It is preferably based on the calculating of priority P riority (URL) that the URL priority distribution method of web page contents obtains Formula is:
Priority (URL)=θ × sim (VDk,VTk)+γ×sim(VAk,VTk)
Wherein, θ and γ represents father's web page contents degree of subject relativity and the decay factor of Anchor Text degree of subject relativity respectively, and Meet θ+γ=1.
Preferably, described decay factor θ is set to 0.4, and γ is set to 0.6.
A kind of subject network information collecting method taking time intention into account provided by the present invention, by quantifying rising of theme Time beginning and Annual distribution, time-based international standard is carried out Formal Representation time intention, is formed and be intended to by the time and common The diversification method for expressing of the independent composition of key word (non-temporal word), then decoupled method time correlation degree and general keyword Degree of association, is finally dissolved in URL priority distribution method calculating using the Annual distribution of quantization as the variable of certain increasing function Go out URL priority, substantially increase webpage and find quantity and precision ratio.
Detailed description of the invention
In order to the technical characteristic of the present invention, purpose and effect are more clearly understood from, now illustrate that the present invention's is concrete Embodiment.
The invention provides a kind of subject network information collecting method taking time intention into account, it is for for subject events Carrying out the Internet web page information and collect sequence, it comprises the steps:
Step A, utilizes priori data to determine the initial time of subject events, and quantifies its Annual distribution, when obtaining one Between distribution quantized value;
The time of theme is intended to mean the temporal characteristics comprised in theme.The time of theme is intended to be divided into clearly by the present invention Time be intended to and potential time is intended to.Wherein, the clear and definite time is intended to mean in theme and the most clearly provides event horizon, as Theme " earthquake in 2008 " explicitly points out the earthquake information needing to find 2008;The potential time is intended to mean in theme and does not has Specify limiting time feature, but event described by theme itself but implies temporal characteristics, as theme " Wenchuan earthquake " implies river in Shangdong Province Initial time on May 21st, 2008 of valley shake.
In subject network information gathering discovery procedure, the initial time of subject events and Annual distribution play different works With, therefore, the time intention assessment of the present invention mainly includes two parts: the identification of subject events initial time and Annual distribution thereof Identification.
In existing temporal information is retrieved, the identification that the query word time is intended to is mainly by means of some priori data, As user searches for daily record and the news corpus through mark.On this basis, the present invention also will carry out theme by priori data The identification that time is intended to.In a specific embodiment, the present invention by priori data be Google trend (Google Trends) data.
Google trend data refers to the volumes of searches index of a certain query word within the past period.Google trend number According to being not original volumes of searches, but relative to a standardized value of total volumes of searches.After standardization, Google trend Data are value between 0 to 100, and value shows that the most greatly volumes of searches is the biggest.At present, Google trend data has been widely used for disease The aspects such as disease forecasting, conservation biology and network public-opinion.Tracing it to its cause, mainly Google trend data reflects user to this The degree of concern of content involved by query word, volumes of searches is the biggest, shows that the people paid close attention to is the most, and the people paid close attention to is the most, more shows There occurs the event relevant to this content.The present invention is also based on this feature of Google trend data to identify that earth's surface is covered Time of lid subject events is intended to, and is broadly divided into two steps:
(1) initial time of subject events is identified: it is mainly according to volumes of searches index in Google trend data from nothing To the change having.Because according to event Emergence and Development, the evolutionary process that changes and wither away, before subject events produces, paying close attention to this The user of theme is less, and its volumes of searches does not reaches the standard of Google trend data statistics.In Practical Calculation, based on Google The theme initial time recognition methods of trend data only identifies that its start periods volumes of searches index is the theme of 0.Trace it to its cause, one Aspect, is not that each theme has clear and definite initial time (such as theme " earthquake " and be not specific to a certain concrete event, it does not has Have specific initial time), the initiating searches volume index of this distribution subject is not 0;On the other hand Google trend then it is derived from The restriction of data itself, Google trend data starts statistics from January, 2004, occurred before 2004 and is extended to The initiating searches volume index of the theme of 2004 is not 0.Finally, the theme initial time of identification is first in Google trend data Secondary there is the volumes of searches index moment more than 0.
(2) Annual distribution of subject events is quantified: it directly utilizes the change of volumes of searches index in Google trend data Represent, i.e. use volumes of searches index to carry out quantization time distribution.Because Google trend data inherently reflects in the Internet The temperature change of this theme, the i.e. Annual distribution of subject events is paid close attention in different periods.
First, corresponding initial time can be identified according to initial time recognition methods, based on Google trend data Time intention assessment, the initial time identifying subject events that can be rough.Such as theme " Wenchuan earthquake " was at 2008 5 The moon in December, 2008 is paid close attention to by user very much, and commemorates that the moon receives publicity again in May, 2009, with its evolutionary process It is consistent.It is rational that this explanation directly utilizes the Annual distribution of Google trend data quantization subject events.
Additionally, the priori data that Baidu's index also can be intended to as recognition time.It is similar with Google trend data, is Based on the inquiry log of universal search engine Baidu, reflect different theme query words user in the past period Attention rate and imedias advertisement.Theme time intension recognizing method based on Baidu's index and master based on Google trend data Topic time intension recognizing method is similar to, and does not repeats them here.
Step B, the theme that the time of taking into account is intended to represents and relatedness computation: use different method for expressing in theme Time is intended to and general keyword is indicated respectively, and calculates time correlation degree and general keyword degree of association respectively;
During existing subject network information gathering, generally use the master that the single vector representation containment time of tradition is intended to Topic, thus cannot embody initial time and Annual distribution.Therefore, in method provided by the present invention, use different shapes Formula represents the common pass of the general keyword of theme, the beginning and ending time of theme, the Time-distribution of theme and web page contents Keyword and its issuing time.Particularly as follows:
(1) general keyword is represented based on unidirectional metering method: the general keyword of theme and web page contents uses < crucial Word, weight > to expression;Its dimension depends on the number of main KWIC, and in the case of theme is constant, its dimension is fixing Constant.
(2) it is intended to based on time international standard express time: in international standard, the time is divided into moment and period.Theme Initial time and the issuing time of web page contents be typically a time point, use the moment to represent;For the ease of calculating, this The bright initial time utilizing the period to represent theme and end time (i.e. beginning and ending time);When what its Annual distribution reflected is difference The temperature change of this event is paid close attention in the range of between.Therefore, Annual distribution by<period, volumes of searches index>to expression, the wherein period Corresponding time range, volumes of searches exponent pair answers the hot value of subject events.Particularly, for saving memory space, search is not indicated that Volume index is the moment of 0.
Their Formal Representation is as follows:
(1) theme and web page contents Formal Representation generally: given theme T and web page contents D, it can be by as follows Formula represents.
T=< VTk,TST,TTD> (1-2)
D=< VDk,TPT> (1-3)
In formula, VTk, TSTAnd TTDRepresent theme general vector, the beginning and ending time of theme and Annual distribution thereof respectively;VDkWith TPTRepresent general vector and the issuing time thereof of web page contents respectively.
(2) Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDCan be according to following public Formula is expressed.
VTk={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)} (1-4)
TST=[tSTs,tSTe] (1-5)
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>} (1-6)
In formula, kiRepresent the i-th general keyword in theme;wTkiRepresent general keyword kiWeight;S represents theme The number of middle general keyword;tSTsRepresent the initial time of theme, user specify or according to the method identification in step A; tSTeRepresent the end time of theme, user specify or be defaulted as infinity;<[tTDsi,tTDei], λi> express time distribution in I-th<period, volumes of searches index>is right;tTDsiAnd tTDeiIt is respectively initial time and end time, the λ of i-th periodiIt is i-th The volumes of searches exponential quantity of individual period, priori data (the such as Google trend number that these parameters can be used according to step A According to) obtain, and omit the period that volumes of searches index is 0;
(3) Formal Representation of web page contents: its general vector VDkWith issuing time TPTRepresent according to equation below.
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)} (1-7)
TPT=tPT (1-8)
In formula, kiRepresent the i-th general keyword in web page contents;wDkiRepresent its general keyword kiWeight;tPT Represent the issuing time of webpage.
In theme and web page contents, the weighing computation method of general keyword may utilize prior art acquisition, such as, refer to Existing document " Wu H, Chen J, et al.A Focused Crawler for Borderlands Situation Information with Geographical Properties of Place Names[J].Sustainability, 2014,6 (10): 6529-6552. " method provided obtains.
As described in the background art, whether traditional degree of subject relativity computational methods judge it merely with web page contents Relevant to theme, weaken theme initial time can the effect of independent filtration fraction irrelevant information, be easily caused some information Misjudgement, affect the precision ratio of topic crawling.The present invention is based on tradition vector space model, from initial time and common pass Two aspects of keyword are set out, use two step method to judge the degree of association between web page contents and theme, thus provide a kind of newly Take the degree of subject relativity calculative strategy of initial time into account.Its calculation process is broadly divided into following two steps:
(1) theme and the time correlation degree of web page contents are calculated.Because being the theme, initial time can be individually used for filtration fraction Incoherent information, therefore, only need to compare the issuing time of web page contents and theme beginning and ending time can preliminary judgement its whether Relevant to theme.Therefore, the calculating of time correlation degree can be shown in equation below.
s i m ( T P T , T S T ) = 0 t P T < t S T s 1 t S T s &le; t P T &GreaterEqual; t S T e - - - ( 1 - 9 )
In formula, sim (TPT,TST) represent theme and the time correlation angle value of web page contents;Other parameter is as previously mentioned.Time Between relevance degree be 0, represent web page contents uncorrelated with theme, this webpage should be abandoned in creeping;Time correlation angle value is 1, Representing that web page contents may be relevant to theme, its final dependency needs to be further determined that by web page contents.Because of now Between relevance degree be continue to when 1 calculate general subject degree of association.
(2) theme and the general subject degree of association of web page contents are calculated.The general keyword of theme and web page contents is still Using single vector representation, its relevance degree can use traditional cosine formula to calculate, as shown in following equation.
s i m ( V D k , V T k ) = &Sigma; i = 1 s w T k i &times; w D k i &Sigma; i = 1 s w T k i 2 &times; &Sigma; i = 1 s w D k i 2 - - - ( 1 - 10 )
In formula, sim (VDk,VTk) represent theme T and web page contents D general subject relevance degree;The most front institute of other parameter State.If sim is (VDk,VTk) more than or equal to given threshold value time, then judge that this web page contents is relevant to theme;Otherwise, it is determined that net Page content is uncorrelated with theme, and abandons this webpage.
In the degree of subject relativity calculative strategy taking initial time into account, the preferential reason calculating time correlation degree is time phase The calculating closing angle value is fairly simple.
Step C, the time correlation degree calculated according to step B and general keyword degree of association, build with obtaining in step A The increasing function that quantized value is variable of the described Annual distribution obtained, and it is dissolved into URL priority based on web page contents Distribution method, thus obtain URL priority distribution formula based on Annual distribution quantized value so that the concerned moment URL obtains higher priority, thus solves Annual distribution equalization problem.
During subject network information gathering, the Annual distribution of theme can affect the order of priority of INFORMATION DISCOVERY.Specifically Show themselves in that if issuing time t of web page contents corresponding to a certain URL exists more related web page, then determine at theme T Under premise, issuing time is that the web page contents of t is relatively big with probability P r (t/T) that theme T-phase is closed, i.e. the URL in this moment has Higher priority.But existing URL priority distribution method does not consider this characteristic.
In order to solve this problem, the present invention is with quantized value (the searching in the most aforementioned Google trend data of Annual distribution Rope volume index) based on, it is provided that a kind of URL priority distribution method based on Annual distribution quantized value.Its process is:
First, build increasing function with quantized value as independent variable: due to Annual distribution quantized value to a certain extent Issue the quantity of its related web page in reflecting a certain period, and quantized value presents the trend of direct ratio with associated nets number of pages, i.e. measures Change value is the biggest, shows that the related web page issued is the most, and increasing function exactly can present this characteristic.Therefore the present invention selects Build the exponential function (natural exponential function) with Annual distribution quantized value as index, with natural constant e as the end.
Then, increasing function and URL priority distribution method based on web page contents are merged: before fusion, this method elder generation base URL priority distribution method in web page contents calculates its content prioritization, when its value is more than or equal to a certain threshold value given, Just merge.This, primarily to guarantee that Annual distribution only affects the discovery order of related web page correspondence URL, prevents from improving not The discovery order of related web page correspondence URL.When merging, increasing function is mainly multiplied by by the present invention its content prioritization.
Finally, the formula of URL priority based on Annual distribution quantized value distribution is as follows.
In formula, PriorityT(URL) final URL priority is represented;
Priority (URL) is the priority that existing URL priority distribution method based on web page contents obtains, its meter Calculate the formula (1-1) that formula can be provided by background technology;Pr (t/T) is the standardized value of Annual distribution quantized value, also illustrates that Issuing time is the webpage probability with theme T-phase pass of t;Threshold value in this formula is in 0 to 1 interval value, when it is 1, table Show that URL priority the most traditionally calculates;When it is 0, represent that URL priority is always according to the side incorporating Annual distribution Method calculates.
In a preferred embodiment, the calculating process master of URL priority distribution method based on Annual distribution quantized value It is divided into six steps, specific as follows:
(1) Annual distribution of theme is quantified.The Annual distribution of theme can be obtained by Google trend data, and it quantifies Value is volumes of searches index in Google trend data.
(2) issuing time t of web page contents corresponding to URL to be downloaded is estimated.
During INFORMATION DISCOVERY, corresponding to URL to be downloaded, the issuing time of web page contents is unknown.In the present invention In, its computational methods mainly have two kinds:
1) computational methods based on URL character string information: when URL character string to be downloaded itself comprises temporal information (as " 20080905 " in " http://news.sohu.com/20080905/n259388056.shtml " are right by URL to be downloaded Answer the issuing time of webpage), utilize corresponding timed regular expression to extract this time, and right as URL to be downloaded institute Answer the issuing time of web page contents;
2) computational methods based on father's web page contents time: when URL character string to be downloaded itself does not comprises temporal information, Using the issuing time of URL father's web page contents to be downloaded as the issuing time of web page contents corresponding to it.Because on the one hand treating down Carry the issuing time of URL father's web page contents generally the most all less times greater than or during equal to the issue of web page contents corresponding to URL to be downloaded Between, and the interval of Google each period of trend data is bigger.On the other hand, this hypothesis has no effect on URL to be downloaded Corresponding webpage and the relevance degree of theme, simply affect the discovery order of this URL.
(3) the quantized value Pr (t/T) of normalized temporal distribution.As it has been described above, searching of period corresponding to time t only need to be obtained Rope volume index standardization, as shown by the following formula.
Parameter in formula is as previously mentioned.
(4) the Anchor Text degree of subject relativity value sim (V of URL to be downloaded is calculatedAk,VTk).Wherein, Anchor Text vector is (by anchor Text and context and URL character string information thereof form) as shown by the following formula,
VAk={ (k1,wAk1),(k2,wAk2),...,(ks,wAks)} (1-13)
Anchor Text degree of subject relativity value is as shown by the following formula.
s i m ( V A k , V T k ) = &Sigma; i = 1 s w T k i &times; w A k i &Sigma; i = 1 s w T k i 2 &times; &Sigma; i = 1 s w A k i 2 - - - ( 1 - 14 )
In formula, VAkRepresent Anchor Text vector;wAkiRepresent general keyword k in Anchor TextiWeight;Other parameter is the same Described.
(5) the content prioritization Priority (URL) of URL to be downloaded is calculated: its computing formula is as stated in the Background Art.Cause Being the webpage direct description to URL to be downloaded for Anchor Text, for the content of father's webpage, Anchor Text is more important, so In the present invention decay factor θ in formula and γ are respectively set to 0.4 and 0.6.
(6) the final priority of URL to be downloaded is calculated: its computing formula is as shown in (1-11), through experimental analysis, the present invention Threshold value in formula (1-11) is set to 0.4.
In a specific embodiment, it is contemplated that as much as possible find to have the network of temporal characteristics from network Change information, the fewest incoherent information of download.Its basic procedure can include following five steps:
(1) preparation: user needs given content theme and the initial URL relevant to theme.Then, utilize based on The time intension recognizing method of Google trend data determines the initial time of theme, and quantifies its Annual distribution.
(2) request and analyzing web page: utilize http protocol excellent in the Internet request initial URL or URL priority query The URL that first level is the highest, in order to obtain the corresponding web page contents of this URL.Secondly, according to the DOM Document Object Model of webpage (Document Object Model, DOM), parse the corresponding title of webpage, text, issuing time, URL to be downloaded and Anchor Text information.
(3) degree of subject relativity calculates: first, according to the theme initial time obtained in step (1) and (2) and web page contents Issuing time, utilizes formula (1-2) to represent the beginning and ending time of theme, general keyword, Annual distribution and web page contents to (1-6) General keyword and issuing time;Then utilize formula (1-9) to calculate their time correlation degree, filter out and have with theme The web page contents of Before sequential relationship;Then, formula (1-10) is utilized to calculate general subject relevance degree.When relevance degree is big When equal to a certain threshold value, then this webpage is saved in web page resources storehouse;Otherwise, it is determined that this webpage is uncorrelated with theme, and lose Abandon this webpage.
(4) URL priority distribution: calculate URL priority according to formula (1-11) to (1-14), then according to this priority Value is deposited in URL priority query.
(5) when repeating step (2), (3) and (4) until URL priority query is empty or reaches a certain cycling condition.
In the case of hardware condition is identical with the network bandwidth, method provided by the present invention is believed than existing subject network Breath acquisition method improves the webpage capture quantity of 10%-30%, and can improve the precision ratio of about 10%.
A kind of subject network information collecting method taking time intention into account provided by the present invention, by quantifying rising of theme Time beginning and Annual distribution, time-based international standard is carried out Formal Representation time intention, is formed and be intended to by the time and common The diversification method for expressing of the independent composition of key word (non-temporal word), then decoupled method time correlation degree and general keyword Degree of association, is finally dissolved in URL priority distribution method calculating using the Annual distribution of quantization as the variable of certain increasing function Go out URL priority, substantially increase webpage and find quantity and precision ratio.
Although it will be appreciated by those skilled in the art that the present invention is to be described according to the mode of multiple embodiments, but It is that the most each embodiment only comprises an independent technical scheme.For the sake of in description, so narration is only used to understand, Description should be understood by those skilled in the art as an entirety, and by technical scheme involved in each embodiment Regard as and can understand protection scope of the present invention in the way of being mutually combined into different embodiment.
The foregoing is only the schematic detailed description of the invention of the present invention, be not limited to the scope of the present invention.Any Those skilled in the art, the equivalent variations made, revises and combines on the premise of without departing from the design of the present invention and principle, The scope of protection of the invention all should be belonged to.

Claims (8)

1. the subject network information collecting method that the time of taking into account is intended to, it is for carrying out internet web page for subject events Information sorts, it is characterised in that it comprises the steps:
Step A, utilizes priori data to determine the initial time of subject events, and quantifies its Annual distribution, obtains a time and divides The quantized value of cloth.
Step B, uses different method for expressing to be intended to the time in theme and general keyword is indicated respectively, and respectively Calculate time correlation degree and general keyword degree of association;
Step C, the time correlation degree calculated according to step B and general keyword degree of association, build with described in the acquisition of step A The quantized value of Annual distribution is the increasing function of variable, and is dissolved into URL priority distribution method based on web page contents, Thus obtain URL priority distribution formula based on Annual distribution quantized value, calculate final URL priority.The most just The URL making the concerned moment obtains higher priority.
Method the most according to claim 1, it is characterised in that the described priori data in step A is Google trend number According to.
Method the most according to claim 1, it is characterised in that in step B, the expression way that the time in theme is intended to is such as Under;
Theme and web page contents Formal Representation generally: given theme T and web page contents D, it represents as follows.
T=< VTk,TST,TTD>
D=< VDk,TPT>
Wherein, VTk, TSTAnd TTDRepresent theme general vector, the beginning and ending time of theme and Annual distribution thereof respectively;VDkAnd TPTRespectively Represent general vector and the issuing time thereof of web page contents.
The Formal Representation of theme: its general vector VTk, beginning and ending time TSTWith Annual distribution TTDExpress according to equation below.
VTk={ (k1,wTk1),(k2,wTk2),...,(ks,wTks)}
TST=[tSTs,tSTe]
TTD={ < [tTDs1,tTDe1], λ1>,...,<[tTDsr,tTDer], λr>}
Wherein, kiRepresent the i-th general keyword in theme;wTkiRepresent general keyword kiWeight;S represents general in theme The number of clearance keyword;tSTsRepresent the initial time of theme, tSTeRepresent the end time of theme, < [tTDsi,tTDei], λi> table Show that i-th in Annual distribution<period, volumes of searches index>is right;tTDsiAnd tTDeiIt is respectively initial time and the knot of i-th period The bundle time, λiVolumes of searches exponential quantity for the i-th period;
The Formal Representation of web page contents: its general vector VDkWith issuing time TPTRepresent according to equation below.
VDk={ (k1,wDk1),(k2,wDk2),...,(ks,wDks)}
TPT=tPT
Wherein, kiRepresent the i-th general keyword in web page contents;wDkiRepresent its general keyword kiWeight;tPTRepresent The issuing time of webpage.
4., according to the method described in claim 1-3, it is characterised in that in step B, calculate time correlation degree and general keyword The formula of degree of association is as follows;
The time correlation degree calculating theme and web page contents is shown as follows:
Wherein, sim (TPT,TST) represent theme and the time correlation angle value of web page contents;
The general subject degree of association calculating theme and web page contents is shown as follows:
In formula, sim (VDk,VTk) represent theme T and web page contents D general subject relevance degree.
Method the most according to claim 1, it is characterised in that the described URL priority distribution formula in step C For:
Wherein, PriorityT(URL) representing final URL priority, Priority (URL) is existing based on web page contents The priority that URL priority distribution method obtains, Pr (t/T) is the standardized value of Annual distribution quantized value, when also illustrating that issue Between be the probability that closes of webpage and the theme T-phase of t;Described threshold value is in 0 to 1 interval value.
6. according to the method described in claim 1-5, it is characterised in that described threshold value is set to 0.4.
7. according to the method described in claim 1-5, it is characterised in that URL priority distribution method based on web page contents obtains To the computing formula of priority P riority (URL) be:
Priority (URL)=θ × sim (VDk,VTk)+γ×sim(VAk,VTk)
Wherein, θ and γ represents father's web page contents degree of subject relativity and the decay factor of Anchor Text degree of subject relativity respectively, and meets θ+γ=1.
Method the most according to claim 7, it is characterised in that described decay factor θ is set to 0.4, and γ is set to 0.6.
CN201610630419.4A 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account Active CN106250512B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610630419.4A CN106250512B (en) 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610630419.4A CN106250512B (en) 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account

Publications (2)

Publication Number Publication Date
CN106250512A true CN106250512A (en) 2016-12-21
CN106250512B CN106250512B (en) 2019-07-26

Family

ID=57605946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610630419.4A Active CN106250512B (en) 2016-08-04 2016-08-04 A kind of subject network information collecting method for taking time intention into account

Country Status (1)

Country Link
CN (1) CN106250512B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417200A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103631856A (en) * 2013-10-17 2014-03-12 四川大学 Subject visualization method for Chinese document set
WO2015195846A1 (en) * 2014-06-19 2015-12-23 Quixey, Inc. Techniques for focused crawling
CN105528422A (en) * 2015-12-07 2016-04-27 中国建设银行股份有限公司 Focused crawler processing method and apparatus

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640A (en) * 2007-01-22 2008-07-30 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN102073730A (en) * 2011-01-14 2011-05-25 哈尔滨工程大学 Method for constructing topic web crawler system
CN103631856A (en) * 2013-10-17 2014-03-12 四川大学 Subject visualization method for Chinese document set
WO2015195846A1 (en) * 2014-06-19 2015-12-23 Quixey, Inc. Techniques for focused crawling
CN105528422A (en) * 2015-12-07 2016-04-27 中国建设银行股份有限公司 Focused crawler processing method and apparatus

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
余浩: "基于网络信息检索的网页文本抽取和处理的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114417200A (en) * 2022-01-04 2022-04-29 马上消费金融股份有限公司 Network data acquisition method and device and electronic equipment

Also Published As

Publication number Publication date
CN106250512B (en) 2019-07-26

Similar Documents

Publication Publication Date Title
JP7141180B2 (en) Incident search method, device, device and storage medium based on knowledge graph
CN103714084B (en) The method and apparatus of recommendation information
CN110781317B (en) Method and device for constructing event map and electronic equipment
Eeckhout et al. Knowledge spillovers and inequality
Eirinaki et al. Web path recommendations based on page ranking and markov models
CN102279875B (en) Method and device for identifying fishing website
CN102004792B (en) Method and system for generating hot-searching word
CN103177090B (en) A kind of topic detection method and device based on big data
Paranjape et al. Improving website hyperlink structure using server logs
KR102080362B1 (en) Query expansion
Senkul et al. Improving pattern quality in web usage mining by using semantic information
CN102750390B (en) Automatic news webpage element extracting method
CN104899273A (en) Personalized webpage recommendation method based on topic and relative entropy
CN101630327A (en) Design method of theme network crawler system
CN103793434A (en) Content-based image search method and device
CN103714140A (en) Searching method and device based on topic-focused web crawler
CN102737125B (en) Web temporal object model-based outdated webpage information automatic discovering method
CN110012122A (en) A kind of domain name similarity analysis method of word-based embedded technology
CN102306182A (en) Method for excavating user interest based on conceptual semantic background image
CN101388025A (en) Semantic web object ordering method based on Pagerank
CN106250512B (en) A kind of subject network information collecting method for taking time intention into account
Choudhary et al. Role of ranking algorithms for information retrieval
CN107766419A (en) A kind of TextRank file summarization methods and device based on threshold denoising
CN106528802A (en) Data collecting method and device
Wang et al. Data Crawling and Research Based on Topic Web Crawler

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant