CN106250512A

CN106250512A - A kind of subject network information collecting method taking time intention into account

Info

Publication number: CN106250512A
Application number: CN201610630419.4A
Authority: CN
Inventors: 陈军; 武昊; 侯东阳
Original assignee: NATIONAL GEOMATICS CENTER OF CHINA
Current assignee: NATIONAL GEOMATICS CENTER OF CHINA
Priority date: 2016-08-04
Filing date: 2016-08-04
Publication date: 2016-12-21
Anticipated expiration: 2036-08-04
Also published as: CN106250512B

Abstract

A kind of subject network information collecting method taking time intention into account, it collects sequence for carrying out the Internet web page information for subject events, it comprises the steps: step A, priori data is utilized to determine the initial time of subject events, and quantify its Annual distribution, obtain the quantized value of an Annual distribution；Step B, uses different method for expressing to be intended to the time in theme and general keyword is indicated respectively, and calculate time correlation degree and general keyword degree of association respectively；Step C, the time correlation degree calculated according to step B and general keyword degree of association, the quantized value of the described Annual distribution that structure obtains with the step A increasing function as variable, it is thus achieved that URL priority distribution formula based on Annual distribution quantized value, calculates final URL priority.A kind of subject network information collecting method taking time intention into account provided by the present invention, substantially increases webpage and finds quantity and precision ratio.

Description

A kind of subject network information collecting method taking time intention into account

Technical field

The present invention relates to internet web page search field, particularly obtain the theme of the webpage of certain content in the Internet and climb Row method, a kind of subject network information collecting method taking time intention into account.

Background technology

Topic crawling is to obtain a kind of key technology method of specific area webpage in the Internet, it is intended under as much as possible Carry the webpage relevant to designated key.The theme that it is mainly specified according to user, by calculating with degree of subject relativity, URL excellent First level distribution etc. is main crawl policy, constantly obtains the information of related web page from Ubiquitous Network resource.

URL priority distribution method based on web page contents is that traditional theme is creeped conventional method.It is mainly basis Two class relevance degrees are calculated, particularly as follows: (1) father web page contents degree of subject relativity: its value is the highest, and father's webpage is comprised URL priority is the highest；(2) Anchor Text degree of subject relativity: it refers to theme and Anchor Text, Anchor Text context and URL character The relevance degree of the information such as string, wherein the generality of content of pages pointed by URL is described by Anchor Text often.

In URL priority distribution method based on web page contents, father's web page contents degree of subject relativity and Anchor Text theme Degree of association calculates frequently with cosine formula, such as: father's web page contents degree of subject relativity of certain URL is sim (V_Dk,V_Tk), Anchor Text Degree of subject relativity is sim (V_Ak,V_Tk), then priority P riority (URL) of this URL can be calculated as follows:

Priority (URL)=θ × sim (V_Dk,V_Tk)+γ×sim(V_Ak,V_Tk) (1-1)

In above formula, θ and γ represents father's web page contents degree of subject relativity and the decay factor of Anchor Text degree of subject relativity respectively, And meet θ+γ=1.

When utilizing the emergency information of topic crawling method acquisition time sensitivity, time intention usually can be as theme A kind of limit key element.Regulation (2002) according to ISO19100 series standard, time object can be divided into " moment " and " time Section ", a wherein point in " moment " express time space；" period " is equivalent to a line in time and space, has starting point, end The attributes such as point and length.In general, on network, information dissemination about a certain accident mainly appears on event it occurs After, the issuing time i.e. reported should be later than the initial time of accident；On the other hand, there is Emergence and Development, change in accident The evolutionary process changed and wither away, in the different evolutionary phases, the temperature that people pay close attention to this event is the most different, preferentially downloads concern Spending the information of higher period, can meet most of Man's Demands, this reflects the Annual distribution of this event to a certain extent.Also That is, when utilizing theme to carry out network information gathering, it is relevant in information that the time is intended to (such as initial time and Annual distribution) Degree judges and INFORMATION DISCOVERY order of priority distribution aspect has obvious action.

Although filter house can be individually used for when utilizing topic crawling method collecting network information by setting initial time Point incoherent information, and its Annual distribution can affect the order of priority of INFORMATION DISCOVERY, but legacy network information collecting method Still simply paying close attention to the common semantic of theme, the time of analysis and utilization theme is not intended to, and there is asking of Annual distribution equalization Topic, causes its precision ratio low.It is embodied in:

(1) lack the method for expressing that the time is intended to: traditional unidirectional amount theme method for expressing it is merely meant that the key word of theme, The method for expressing that its time is intended to is not provided；

(2) effect of theme initial time is weakened: traditional theme relatedness computation strategy only relies on web page contents and judges Its dependency with theme, weakens the effect of theme initial time；

(3) ignore theme Annual distribution and affect the impact of INFORMATION DISCOVERY order of priority: traditional URL priority distribution method mesh Before mainly utilize web page contents, Anchor Text and context thereof, URL character string, the renewal time of linking relationship even webpage, but Have ignored the impact of theme Annual distribution.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of subject network information collecting method taking time intention into account, with Noted earlier problem is reduced or avoided.

For solving above-mentioned technical problem, the invention provides a kind of subject network information collection side taking time intention into account Method, it is for carrying out the Internet web page information collection sequence for subject events, and it comprises the steps:

Step A, utilizes priori data to determine the initial time of subject events, and quantifies its Annual distribution, when obtaining one Between distribution quantized value；

Step B, uses different method for expressing to be intended to the time in theme and general keyword is indicated respectively, and Calculate time correlation degree and general keyword degree of association respectively；

Step C, the time correlation degree calculated according to step B and general keyword degree of association, build with the acquisition of step A The quantized value of described Annual distribution is the increasing function of variable, and is dissolved into URL priority based on web page contents distribution Method, thus obtain URL priority distribution formula based on Annual distribution quantized value, calculate final URL priority, Also the URL allowing for the concerned moment obtains higher priority.

Preferably, the described priori data in step A is Google trend data.

Preferably, in step B, the expression way that the time in theme is intended to is as follows；

Theme and web page contents Formal Representation generally: given theme T and web page contents D, its table as follows Show.

T=＜ V_Tk,T_ST,T_TD>

D=< V_Dk,T_PT>

Wherein, V_Tk, T_STAnd T_TDRepresent theme general vector, the beginning and ending time of theme and Annual distribution thereof respectively；V_DkWith T_PTRepresent general vector and the issuing time thereof of web page contents respectively.

The Formal Representation of theme: its general vector V_Tk, beginning and ending time T_STWith Annual distribution T_TDAccording to equation below table Reach.

V_Tk={ (k₁,w_Tk1),(k₂,w_Tk2),...,(k_s,w_Tks)}

T_ST=[t_STs,t_STe]

T_TD={ < [t_TDs1,t_TDe1], λ₁>,...,<[t_TDsr,t_TDer], λ_r>}

Wherein, k_iRepresent the i-th general keyword in theme；w_TkiRepresent general keyword k_iWeight；S represents theme The number of middle general keyword；t_STsRepresent the initial time of theme, t_STeRepresent the end time of theme, < [t_TDsi,t_TDei], λ_i >express time distribution in i-th<period, volumes of searches index>right；t_TDsiAnd t_TDeiBe respectively the i-th period initial time and End time, λ_iVolumes of searches exponential quantity for the i-th period；

The Formal Representation of web page contents: its general vector V_DkWith issuing time T_PTRepresent according to equation below.

V_Dk={ (k₁,w_Dk1),(k₂,w_Dk2),...,(k_s,w_Dks)}

T_PT=t_PT

Wherein, k_iRepresent the i-th general keyword in web page contents；w_DkiRepresent its general keyword k_iWeight；t_PT Represent the issuing time of webpage.

Preferably, in step B, the formula calculating time correlation degree and general keyword degree of association is as follows；

The time correlation degree calculating theme and web page contents is shown as follows:

s i m (T_{P T}, T_{S T}) = \{\begin{matrix} 0 & t_{P T} < t_{S T s} \\ 1 & t_{S T s} \leq t_{P T} &GreaterEqual; t_{S T e} \end{matrix}

Wherein, sim (T_PT,T_ST) represent theme and the time correlation angle value of web page contents；

The general subject degree of association calculating theme and web page contents is shown as follows:

s i m (V_{D k}, V_{T k}) = \frac{Σ_{i = 1}^{s} w_{T k i} \times w_{D k i}}{\sqrt{Σ_{i = 1}^{s} w_{T k i}^{2} \times Σ_{i = 1}^{s} w_{D k i}^{2}}}

In formula, sim (V_Dk,V_Tk) represent theme T and web page contents D general subject relevance degree.

Preferably, the described URL priority distribution formula in step C is:

Wherein, Priority_T(URL) representing final URL priority, Priority (URL) is existing based on webpage The priority that the URL priority distribution method of content obtains, Pr (t/T) is the standardized value of Annual distribution quantized value, also illustrates that Issuing time is the webpage probability with theme T-phase pass of t；Described threshold value is in 0 to 1 interval value.

Preferably, described threshold value is set to 0.4.

It is preferably based on the calculating of priority P riority (URL) that the URL priority distribution method of web page contents obtains Formula is:

Priority (URL)=θ × sim (V_Dk,V_Tk)+γ×sim(V_Ak,V_Tk)

Wherein, θ and γ represents father's web page contents degree of subject relativity and the decay factor of Anchor Text degree of subject relativity respectively, and Meet θ+γ=1.

Preferably, described decay factor θ is set to 0.4, and γ is set to 0.6.

A kind of subject network information collecting method taking time intention into account provided by the present invention, by quantifying rising of theme Time beginning and Annual distribution, time-based international standard is carried out Formal Representation time intention, is formed and be intended to by the time and common The diversification method for expressing of the independent composition of key word (non-temporal word), then decoupled method time correlation degree and general keyword Degree of association, is finally dissolved in URL priority distribution method calculating using the Annual distribution of quantization as the variable of certain increasing function Go out URL priority, substantially increase webpage and find quantity and precision ratio.

Detailed description of the invention

In order to the technical characteristic of the present invention, purpose and effect are more clearly understood from, now illustrate that the present invention's is concrete Embodiment.

The invention provides a kind of subject network information collecting method taking time intention into account, it is for for subject events Carrying out the Internet web page information and collect sequence, it comprises the steps:

The time of theme is intended to mean the temporal characteristics comprised in theme.The time of theme is intended to be divided into clearly by the present invention Time be intended to and potential time is intended to.Wherein, the clear and definite time is intended to mean in theme and the most clearly provides event horizon, as Theme " earthquake in 2008 " explicitly points out the earthquake information needing to find 2008；The potential time is intended to mean in theme and does not has Specify limiting time feature, but event described by theme itself but implies temporal characteristics, as theme " Wenchuan earthquake " implies river in Shangdong Province Initial time on May 21st, 2008 of valley shake.

In subject network information gathering discovery procedure, the initial time of subject events and Annual distribution play different works With, therefore, the time intention assessment of the present invention mainly includes two parts: the identification of subject events initial time and Annual distribution thereof Identification.

In existing temporal information is retrieved, the identification that the query word time is intended to is mainly by means of some priori data, As user searches for daily record and the news corpus through mark.On this basis, the present invention also will carry out theme by priori data The identification that time is intended to.In a specific embodiment, the present invention by priori data be Google trend (Google Trends) data.

Google trend data refers to the volumes of searches index of a certain query word within the past period.Google trend number According to being not original volumes of searches, but relative to a standardized value of total volumes of searches.After standardization, Google trend Data are value between 0 to 100, and value shows that the most greatly volumes of searches is the biggest.At present, Google trend data has been widely used for disease The aspects such as disease forecasting, conservation biology and network public-opinion.Tracing it to its cause, mainly Google trend data reflects user to this The degree of concern of content involved by query word, volumes of searches is the biggest, shows that the people paid close attention to is the most, and the people paid close attention to is the most, more shows There occurs the event relevant to this content.The present invention is also based on this feature of Google trend data to identify that earth's surface is covered Time of lid subject events is intended to, and is broadly divided into two steps:

(1) initial time of subject events is identified: it is mainly according to volumes of searches index in Google trend data from nothing To the change having.Because according to event Emergence and Development, the evolutionary process that changes and wither away, before subject events produces, paying close attention to this The user of theme is less, and its volumes of searches does not reaches the standard of Google trend data statistics.In Practical Calculation, based on Google The theme initial time recognition methods of trend data only identifies that its start periods volumes of searches index is the theme of 0.Trace it to its cause, one Aspect, is not that each theme has clear and definite initial time (such as theme " earthquake " and be not specific to a certain concrete event, it does not has Have specific initial time), the initiating searches volume index of this distribution subject is not 0；On the other hand Google trend then it is derived from The restriction of data itself, Google trend data starts statistics from January, 2004, occurred before 2004 and is extended to The initiating searches volume index of the theme of 2004 is not 0.Finally, the theme initial time of identification is first in Google trend data Secondary there is the volumes of searches index moment more than 0.

(2) Annual distribution of subject events is quantified: it directly utilizes the change of volumes of searches index in Google trend data Represent, i.e. use volumes of searches index to carry out quantization time distribution.Because Google trend data inherently reflects in the Internet The temperature change of this theme, the i.e. Annual distribution of subject events is paid close attention in different periods.

First, corresponding initial time can be identified according to initial time recognition methods, based on Google trend data Time intention assessment, the initial time identifying subject events that can be rough.Such as theme " Wenchuan earthquake " was at 2008 5 The moon in December, 2008 is paid close attention to by user very much, and commemorates that the moon receives publicity again in May, 2009, with its evolutionary process It is consistent.It is rational that this explanation directly utilizes the Annual distribution of Google trend data quantization subject events.

Additionally, the priori data that Baidu's index also can be intended to as recognition time.It is similar with Google trend data, is Based on the inquiry log of universal search engine Baidu, reflect different theme query words user in the past period Attention rate and imedias advertisement.Theme time intension recognizing method based on Baidu's index and master based on Google trend data Topic time intension recognizing method is similar to, and does not repeats them here.

Step B, the theme that the time of taking into account is intended to represents and relatedness computation: use different method for expressing in theme Time is intended to and general keyword is indicated respectively, and calculates time correlation degree and general keyword degree of association respectively；

During existing subject network information gathering, generally use the master that the single vector representation containment time of tradition is intended to Topic, thus cannot embody initial time and Annual distribution.Therefore, in method provided by the present invention, use different shapes Formula represents the common pass of the general keyword of theme, the beginning and ending time of theme, the Time-distribution of theme and web page contents Keyword and its issuing time.Particularly as follows:

(1) general keyword is represented based on unidirectional metering method: the general keyword of theme and web page contents uses < crucial Word, weight > to expression；Its dimension depends on the number of main KWIC, and in the case of theme is constant, its dimension is fixing Constant.

(2) it is intended to based on time international standard express time: in international standard, the time is divided into moment and period.Theme Initial time and the issuing time of web page contents be typically a time point, use the moment to represent；For the ease of calculating, this The bright initial time utilizing the period to represent theme and end time (i.e. beginning and ending time)；When what its Annual distribution reflected is difference The temperature change of this event is paid close attention in the range of between.Therefore, Annual distribution by<period, volumes of searches index>to expression, the wherein period Corresponding time range, volumes of searches exponent pair answers the hot value of subject events.Particularly, for saving memory space, search is not indicated that Volume index is the moment of 0.

Their Formal Representation is as follows:

(1) theme and web page contents Formal Representation generally: given theme T and web page contents D, it can be by as follows Formula represents.

T=< V_Tk,T_ST,T_TD> (1-2)

D=< V_Dk,T_PT> (1-3)

In formula, V_Tk, T_STAnd T_TDRepresent theme general vector, the beginning and ending time of theme and Annual distribution thereof respectively；V_DkWith T_PTRepresent general vector and the issuing time thereof of web page contents respectively.

(2) Formal Representation of theme: its general vector V_Tk, beginning and ending time T_STWith Annual distribution T_TDCan be according to following public Formula is expressed.

V_Tk={ (k₁,w_Tk1),(k₂,w_Tk2),...,(k_s,w_Tks)} (1-4)

T_ST=[t_STs,t_STe] (1-5)

T_TD={ < [t_TDs1,t_TDe1], λ₁>,...,<[t_TDsr,t_TDer], λ_r>} (1-6)

In formula, k_iRepresent the i-th general keyword in theme；w_TkiRepresent general keyword k_iWeight；S represents theme The number of middle general keyword；t_STsRepresent the initial time of theme, user specify or according to the method identification in step A； t_STeRepresent the end time of theme, user specify or be defaulted as infinity；<[t_TDsi,t_TDei], λ_i> express time distribution in I-th<period, volumes of searches index>is right；t_TDsiAnd t_TDeiIt is respectively initial time and end time, the λ of i-th period_iIt is i-th The volumes of searches exponential quantity of individual period, priori data (the such as Google trend number that these parameters can be used according to step A According to) obtain, and omit the period that volumes of searches index is 0；

(3) Formal Representation of web page contents: its general vector V_DkWith issuing time T_PTRepresent according to equation below.

V_Dk={ (k₁,w_Dk1),(k₂,w_Dk2),...,(k_s,w_Dks)} (1-7)

T_PT=t_PT (1-8)

In formula, k_iRepresent the i-th general keyword in web page contents；w_DkiRepresent its general keyword k_iWeight；t_PT Represent the issuing time of webpage.

In theme and web page contents, the weighing computation method of general keyword may utilize prior art acquisition, such as, refer to Existing document " Wu H, Chen J, et al.A Focused Crawler for Borderlands Situation Information with Geographical Properties of Place Names[J].Sustainability, 2014,6 (10): 6529-6552. " method provided obtains.

As described in the background art, whether traditional degree of subject relativity computational methods judge it merely with web page contents Relevant to theme, weaken theme initial time can the effect of independent filtration fraction irrelevant information, be easily caused some information Misjudgement, affect the precision ratio of topic crawling.The present invention is based on tradition vector space model, from initial time and common pass Two aspects of keyword are set out, use two step method to judge the degree of association between web page contents and theme, thus provide a kind of newly Take the degree of subject relativity calculative strategy of initial time into account.Its calculation process is broadly divided into following two steps:

(1) theme and the time correlation degree of web page contents are calculated.Because being the theme, initial time can be individually used for filtration fraction Incoherent information, therefore, only need to compare the issuing time of web page contents and theme beginning and ending time can preliminary judgement its whether Relevant to theme.Therefore, the calculating of time correlation degree can be shown in equation below.

s i m (T_{P T}, T_{S T}) = \{\begin{matrix} 0 & t_{P T} < t_{S T s} \\ 1 & t_{S T s} \leq t_{P T} &GreaterEqual; t_{S T e} \end{matrix} - - - (1 - 9)

In formula, sim (T_PT,T_ST) represent theme and the time correlation angle value of web page contents；Other parameter is as previously mentioned.Time Between relevance degree be 0, represent web page contents uncorrelated with theme, this webpage should be abandoned in creeping；Time correlation angle value is 1, Representing that web page contents may be relevant to theme, its final dependency needs to be further determined that by web page contents.Because of now Between relevance degree be continue to when 1 calculate general subject degree of association.

(2) theme and the general subject degree of association of web page contents are calculated.The general keyword of theme and web page contents is still Using single vector representation, its relevance degree can use traditional cosine formula to calculate, as shown in following equation.

s i m (V_{D k}, V_{T k}) = \frac{Σ_{i = 1}^{s} w_{T k i} \times w_{D k i}}{\sqrt{Σ_{i = 1}^{s} w_{T k i}^{2} \times Σ_{i = 1}^{s} w_{D k i}^{2}}} - - - (1 - 10)

In formula, sim (V_Dk,V_Tk) represent theme T and web page contents D general subject relevance degree；The most front institute of other parameter State.If sim is (V_Dk,V_Tk) more than or equal to given threshold value time, then judge that this web page contents is relevant to theme；Otherwise, it is determined that net Page content is uncorrelated with theme, and abandons this webpage.

In the degree of subject relativity calculative strategy taking initial time into account, the preferential reason calculating time correlation degree is time phase The calculating closing angle value is fairly simple.

Step C, the time correlation degree calculated according to step B and general keyword degree of association, build with obtaining in step A The increasing function that quantized value is variable of the described Annual distribution obtained, and it is dissolved into URL priority based on web page contents Distribution method, thus obtain URL priority distribution formula based on Annual distribution quantized value so that the concerned moment URL obtains higher priority, thus solves Annual distribution equalization problem.

During subject network information gathering, the Annual distribution of theme can affect the order of priority of INFORMATION DISCOVERY.Specifically Show themselves in that if issuing time t of web page contents corresponding to a certain URL exists more related web page, then determine at theme T Under premise, issuing time is that the web page contents of t is relatively big with probability P r (t/T) that theme T-phase is closed, i.e. the URL in this moment has Higher priority.But existing URL priority distribution method does not consider this characteristic.

In order to solve this problem, the present invention is with quantized value (the searching in the most aforementioned Google trend data of Annual distribution Rope volume index) based on, it is provided that a kind of URL priority distribution method based on Annual distribution quantized value.Its process is:

First, build increasing function with quantized value as independent variable: due to Annual distribution quantized value to a certain extent Issue the quantity of its related web page in reflecting a certain period, and quantized value presents the trend of direct ratio with associated nets number of pages, i.e. measures Change value is the biggest, shows that the related web page issued is the most, and increasing function exactly can present this characteristic.Therefore the present invention selects Build the exponential function (natural exponential function) with Annual distribution quantized value as index, with natural constant e as the end.

Then, increasing function and URL priority distribution method based on web page contents are merged: before fusion, this method elder generation base URL priority distribution method in web page contents calculates its content prioritization, when its value is more than or equal to a certain threshold value given, Just merge.This, primarily to guarantee that Annual distribution only affects the discovery order of related web page correspondence URL, prevents from improving not The discovery order of related web page correspondence URL.When merging, increasing function is mainly multiplied by by the present invention its content prioritization.

Finally, the formula of URL priority based on Annual distribution quantized value distribution is as follows.

In formula, Priority_T(URL) final URL priority is represented；

Priority (URL) is the priority that existing URL priority distribution method based on web page contents obtains, its meter Calculate the formula (1-1) that formula can be provided by background technology；Pr (t/T) is the standardized value of Annual distribution quantized value, also illustrates that Issuing time is the webpage probability with theme T-phase pass of t；Threshold value in this formula is in 0 to 1 interval value, when it is 1, table Show that URL priority the most traditionally calculates；When it is 0, represent that URL priority is always according to the side incorporating Annual distribution Method calculates.

In a preferred embodiment, the calculating process master of URL priority distribution method based on Annual distribution quantized value It is divided into six steps, specific as follows:

(1) Annual distribution of theme is quantified.The Annual distribution of theme can be obtained by Google trend data, and it quantifies Value is volumes of searches index in Google trend data.

(2) issuing time t of web page contents corresponding to URL to be downloaded is estimated.

During INFORMATION DISCOVERY, corresponding to URL to be downloaded, the issuing time of web page contents is unknown.In the present invention In, its computational methods mainly have two kinds:

1) computational methods based on URL character string information: when URL character string to be downloaded itself comprises temporal information (as " 20080905 " in " http://news.sohu.com/20080905/n259388056.shtml " are right by URL to be downloaded Answer the issuing time of webpage), utilize corresponding timed regular expression to extract this time, and right as URL to be downloaded institute Answer the issuing time of web page contents；

2) computational methods based on father's web page contents time: when URL character string to be downloaded itself does not comprises temporal information, Using the issuing time of URL father's web page contents to be downloaded as the issuing time of web page contents corresponding to it.Because on the one hand treating down Carry the issuing time of URL father's web page contents generally the most all less times greater than or during equal to the issue of web page contents corresponding to URL to be downloaded Between, and the interval of Google each period of trend data is bigger.On the other hand, this hypothesis has no effect on URL to be downloaded Corresponding webpage and the relevance degree of theme, simply affect the discovery order of this URL.

(3) the quantized value Pr (t/T) of normalized temporal distribution.As it has been described above, searching of period corresponding to time t only need to be obtained Rope volume index standardization, as shown by the following formula.

Parameter in formula is as previously mentioned.

(4) the Anchor Text degree of subject relativity value sim (V of URL to be downloaded is calculated_Ak,V_Tk).Wherein, Anchor Text vector is (by anchor Text and context and URL character string information thereof form) as shown by the following formula,

V_Ak={ (k₁,w_Ak1),(k₂,w_Ak2),...,(k_s,w_Aks)} (1-13)

Anchor Text degree of subject relativity value is as shown by the following formula.

s i m (V_{A k}, V_{T k}) = \frac{Σ_{i = 1}^{s} w_{T k i} \times w_{A k i}}{\sqrt{Σ_{i = 1}^{s} w_{T k i}^{2} \times Σ_{i = 1}^{s} w_{A k i}^{2}}} - - - (1 - 14)

In formula, V_AkRepresent Anchor Text vector；w_AkiRepresent general keyword k in Anchor Text_iWeight；Other parameter is the same Described.

(5) the content prioritization Priority (URL) of URL to be downloaded is calculated: its computing formula is as stated in the Background Art.Cause Being the webpage direct description to URL to be downloaded for Anchor Text, for the content of father's webpage, Anchor Text is more important, so In the present invention decay factor θ in formula and γ are respectively set to 0.4 and 0.6.

(6) the final priority of URL to be downloaded is calculated: its computing formula is as shown in (1-11), through experimental analysis, the present invention Threshold value in formula (1-11) is set to 0.4.

In a specific embodiment, it is contemplated that as much as possible find to have the network of temporal characteristics from network Change information, the fewest incoherent information of download.Its basic procedure can include following five steps:

(1) preparation: user needs given content theme and the initial URL relevant to theme.Then, utilize based on The time intension recognizing method of Google trend data determines the initial time of theme, and quantifies its Annual distribution.

(2) request and analyzing web page: utilize http protocol excellent in the Internet request initial URL or URL priority query The URL that first level is the highest, in order to obtain the corresponding web page contents of this URL.Secondly, according to the DOM Document Object Model of webpage (Document Object Model, DOM), parse the corresponding title of webpage, text, issuing time, URL to be downloaded and Anchor Text information.

(3) degree of subject relativity calculates: first, according to the theme initial time obtained in step (1) and (2) and web page contents Issuing time, utilizes formula (1-2) to represent the beginning and ending time of theme, general keyword, Annual distribution and web page contents to (1-6) General keyword and issuing time；Then utilize formula (1-9) to calculate their time correlation degree, filter out and have with theme The web page contents of Before sequential relationship；Then, formula (1-10) is utilized to calculate general subject relevance degree.When relevance degree is big When equal to a certain threshold value, then this webpage is saved in web page resources storehouse；Otherwise, it is determined that this webpage is uncorrelated with theme, and lose Abandon this webpage.

(4) URL priority distribution: calculate URL priority according to formula (1-11) to (1-14), then according to this priority Value is deposited in URL priority query.

(5) when repeating step (2), (3) and (4) until URL priority query is empty or reaches a certain cycling condition.

In the case of hardware condition is identical with the network bandwidth, method provided by the present invention is believed than existing subject network Breath acquisition method improves the webpage capture quantity of 10%-30%, and can improve the precision ratio of about 10%.

Although it will be appreciated by those skilled in the art that the present invention is to be described according to the mode of multiple embodiments, but It is that the most each embodiment only comprises an independent technical scheme.For the sake of in description, so narration is only used to understand, Description should be understood by those skilled in the art as an entirety, and by technical scheme involved in each embodiment Regard as and can understand protection scope of the present invention in the way of being mutually combined into different embodiment.

The foregoing is only the schematic detailed description of the invention of the present invention, be not limited to the scope of the present invention.Any Those skilled in the art, the equivalent variations made, revises and combines on the premise of without departing from the design of the present invention and principle, The scope of protection of the invention all should be belonged to.

Claims

1. the subject network information collecting method that the time of taking into account is intended to, it is for carrying out internet web page for subject events Information sorts, it is characterised in that it comprises the steps:

Step A, utilizes priori data to determine the initial time of subject events, and quantifies its Annual distribution, obtains a time and divides The quantized value of cloth.

Step B, uses different method for expressing to be intended to the time in theme and general keyword is indicated respectively, and respectively Calculate time correlation degree and general keyword degree of association；

Step C, the time correlation degree calculated according to step B and general keyword degree of association, build with described in the acquisition of step A The quantized value of Annual distribution is the increasing function of variable, and is dissolved into URL priority distribution method based on web page contents, Thus obtain URL priority distribution formula based on Annual distribution quantized value, calculate final URL priority.The most just The URL making the concerned moment obtains higher priority.

Method the most according to claim 1, it is characterised in that the described priori data in step A is Google trend number According to.

Method the most according to claim 1, it is characterised in that in step B, the expression way that the time in theme is intended to is such as Under；

Theme and web page contents Formal Representation generally: given theme T and web page contents D, it represents as follows.

T=< V_Tk,T_ST,T_TD>

D=< V_Dk,T_PT>

Wherein, V_Tk, T_STAnd T_TDRepresent theme general vector, the beginning and ending time of theme and Annual distribution thereof respectively；V_DkAnd T_PTRespectively Represent general vector and the issuing time thereof of web page contents.

The Formal Representation of theme: its general vector V_Tk, beginning and ending time T_STWith Annual distribution T_TDExpress according to equation below.

V_Tk={ (k₁,w_Tk1),(k₂,w_Tk2),...,(k_s,w_Tks)}

T_ST=[t_STs,t_STe]

T_TD={ < [t_TDs1,t_TDe1], λ₁>,...,<[t_TDsr,t_TDer], λ_r>}

Wherein, k_iRepresent the i-th general keyword in theme；w_TkiRepresent general keyword k_iWeight；S represents general in theme The number of clearance keyword；t_STsRepresent the initial time of theme, t_STeRepresent the end time of theme, < [t_TDsi,t_TDei], λ_i> table Show that i-th in Annual distribution<period, volumes of searches index>is right；t_TDsiAnd t_TDeiIt is respectively initial time and the knot of i-th period The bundle time, λ_iVolumes of searches exponential quantity for the i-th period；

V_Dk={ (k₁,w_Dk1),(k₂,w_Dk2),...,(k_s,w_Dks)}

T_PT=t_PT

Wherein, k_iRepresent the i-th general keyword in web page contents；w_DkiRepresent its general keyword k_iWeight；t_PTRepresent The issuing time of webpage.

4., according to the method described in claim 1-3, it is characterised in that in step B, calculate time correlation degree and general keyword The formula of degree of association is as follows；

Method the most according to claim 1, it is characterised in that the described URL priority distribution formula in step C For:

Wherein, Priority_T(URL) representing final URL priority, Priority (URL) is existing based on web page contents The priority that URL priority distribution method obtains, Pr (t/T) is the standardized value of Annual distribution quantized value, when also illustrating that issue Between be the probability that closes of webpage and the theme T-phase of t；Described threshold value is in 0 to 1 interval value.

6. according to the method described in claim 1-5, it is characterised in that described threshold value is set to 0.4.

7. according to the method described in claim 1-5, it is characterised in that URL priority distribution method based on web page contents obtains To the computing formula of priority P riority (URL) be:

Priority (URL)=θ × sim (V_Dk,V_Tk)+γ×sim(V_Ak,V_Tk)

Wherein, θ and γ represents father's web page contents degree of subject relativity and the decay factor of Anchor Text degree of subject relativity respectively, and meets θ+γ=1.

Method the most according to claim 7, it is characterised in that described decay factor θ is set to 0.4, and γ is set to 0.6.