Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below in conjunction with accompanying drawing.
In embodiment of the present invention, with each Intelligence Page source as the voter, with every piece of reprinting information as the ballot subject matter, with the popular degree in each the Intelligence Page source weight as ballot.By every piece of ballot score of reprinting information of COMPREHENSIVE CALCULATING, the reprinting information that makes a good score is regarded as hot information, side by side in front, simultaneously, consider that dissemination of news needs the time, can use the issuing time of reprinting information as correction factor, proofread and correct the ballot score, thereby obtain last temperature rank.
Fig. 1 is the hot information method for digging schematic flow sheet according to embodiment of the present invention.
As shown in Figure 1, the method comprises:
Step 101: according to the relative temperature value between the access times computing information webpage source in Intelligence Page source.
Can by the access log of hot information in the Intelligence Page source and the access log of other news, calculate the access temperature in each Intelligence Page source here.Such as, the access times addition of putting down in writing in the access times of putting down in writing in the access log with hot information and the access log of other news is as the access times in Intelligence Page source.
Preferably, the Intelligence Page source can be various types of news websites.
The access temperature in computing information webpage source can have multiple account form, and its principle is: the access times in Intelligence Page source are more, and the relative temperature value in this Intelligence Page source should be higher.Such as:
For k Intelligence Page source, calculate its relative temperature value SiteHotness
k, wherein:
Wherein norm is normalization coefficient; AccessCount
kBe the access times in k Intelligence Page source, K is the set in all Intelligence Page sources.
Such as: suppose to have gathered A in certain search engine, B, the news in three information webpages of C source supposes that these three information webpage sources are respectively 50,20,30 in the access times (AccessCount) of search engine.
The temperature SiteHotness of website C then
C=norm* (log (30)/log (50+20+30));
The temperature SiteHotness of website B
B=norm* (log (20)/log (50+20+30));
The temperature SiteHotness of website A
A=norm* (log (50)/log (50+20+30)).
The truth of a matter in the above-mentioned logarithm can be 10, also can be e.Thereby guarantee the temperature SiteHotness of website A
AGreater than the temperature SiteHotness of website C then
C, and the temperature SiteHotness of website C
CTemperature SiteHotness greater than website B
B
Wherein, according to the concrete experience in application, the concrete value of norm can be made corresponding variation or adjustment.
Step 102: calculating each reprinting information according to the relative temperature value in Intelligence Page source has reprinting weight in the Intelligence Page source of this reprinting information in reprinting.
Can from each Intelligence Page source, determine described reprinting information based on the similarity algorithm of text feature here.Identify the papers published of news by the similarity algorithm based on text feature, namely identify the reprinting which news belongs to same piece of writing news.
Preferably, can further determine time factor according to each issuing time of reprinting information, and utilize this time factor that each heatrate value is revised.Exemplarily, can also will reprint the reproduced time of information as time factor.
Such as: for i reprinting information, calculate its heatrate value NewsHotness
i
Wherein:
CitationHotness
k=g(SiteHotness
k);
Wherein K is the set that all reprinted the Intelligence Page source of this i reprinting information; PublishTime is the issuing time of this i reprinting information; F (PublishTime) transfers weight function, CitationHotness about the time of PublishTime
kFor this i reprinting information at k reprinting weight of reprinting in the Intelligence Page source that this reprinting information is arranged, g (SiteHotness
k) be about SiteHotness
kTemperature transfer weight function.
Time transfers weight function f (PublishTime) to be used for guarantee information temperature value NewsHotness
iTimeliness n.Usually, issuing time PublishTime is the closer to current time, and then the value of time accent weight function f (PublishTime) should be larger.
Time transfers the concrete functional form of weight function f (PublishTime) that numerous embodiments can be arranged, and can be linear, also can be nonlinear.As long as meet issuing time PublishTime the closer to current time, then the value of time accent weight function f (PublishTime) should larger (thereby guarantee information temperature value NewsHotness
iValue can be larger) cardinal rule, embodiment of the present invention is to concrete functional form and the indefinite of f (PublishTime).
G (SiteHotness
k) be that temperature is transferred weight function, be used for guaranteeing to reprint weight CitationHotness
kQuality index.Usually, the relative temperature value SiteHotness of some websites
kHigher, then it reprints weight CitationHotness
kValue should be larger.
Similarly, temperature is transferred weight function g (SiteHotness
k) concrete functional form numerous embodiments can be arranged, can be linear, also can be nonlinear.In fact, as long as meet the relative temperature value SiteHotness of website
kHigher, then temperature is transferred weight function CitationHotness
kThe larger cardinal rule of value, embodiment of the present invention is to concrete functional form and the indefinite of f (PublishTime).
Step 103: the reprinting weight of each reprinting information in each Intelligence Page source sued for peace, calculate the heatrate value that each reprints information, and from described reprinting information, determine hot information according to described heatrate value size order.
Here, each reprinting weight of reprinting information is sued for peace, thereby calculate the heatrate value mark that each reprints information, then can according to after the height ordering, select suitable news number to represent.
Such as, can set in advance as showing 10 hot informations.After the heatrate value of each being reprinted information according to the height ordering is divided into line ordering, select from high to low 10 news numbers to represent as hot information so.
Preferably, in embodiment of the present invention, can also be first all news category, domestic such as being divided into, international, amusements etc. are used embodiment of the present invention again and are excavated each hot information in classifying in concrete classified news.
Based on above-mentioned analysis, embodiment of the present invention has also proposed a kind of hot information digging system.
Fig. 2 is the hot information method for digging system schematic according to embodiment of the present invention.
As shown in Figure 2, this system comprises relative temperature value computing unit 201, reprints weight calculation unit 202 and hot information determining unit 203.
Wherein:
Temperature value computing unit 201 is used for according to the relative temperature value between the access times computing information webpage source in Intelligence Page source relatively;
Reprint weight calculation unit 202, be used for calculating each reprinting information has the Intelligence Page source of this reprinting information in reprinting reprinting weight according to the relative temperature value in Intelligence Page source;
Hot information determining unit 203, be used for each reprinting information is sued for peace in the reprinting weight in each Intelligence Page source, calculate the heatrate value that each reprints information, and from described reprinting information, determine hot information according to described heatrate value size order.
Preferably, hot information determining unit 203, the issuing time that is further used for the information of reprinting according to each is determined time factor, and utilizes described time factor that described each heatrate value is revised.
Preferably, weight calculation unit 202 is further used for determining described reprinting information based on the similarity algorithm of text feature from each Intelligence Page source.
In one embodiment, temperature value computing unit 201 is used for for k Intelligence Page source relatively, calculates its relative temperature value SiteHotness
k, wherein:
Wherein norm is normalization coefficient; AccessCount
kBe the access times in k Intelligence Page source, K is the set in all Intelligence Page sources.
In one embodiment, weight calculation unit 202 is used for for i reprinting information, calculates its heatrate value NewsHotness
i
CitationHotness
k=g(SiteHotness
k);
Wherein K is the set that all reprinted the Intelligence Page source of this i reprinting information; PublishTime is the issuing time of this i reprinting information; F (PublishTime) transfers weight function, CitationHotness about the time of PublishTime
kFor this i reprinting information at k reprinting weight of reprinting in the Intelligence Page source that this reprinting information is arranged, g (SiteHotness
k) be about SiteHotness
kTemperature transfer weight function.
Similarly, the time transfers weight function f (PublishTime) to be used for guarantee information temperature value NewsHotness
iTimeliness n.Usually, issuing time PublishTime is the closer to current time, and then the value of time accent weight function f (PublishTime) should be larger.
Time transfers the concrete functional form of weight function f (PublishTime) that numerous embodiments can be arranged, and can be linear, also can be nonlinear.As long as meet issuing time PublishTime the closer to current time, the cardinal rule that then value of time accent weight function f (PublishTime) should be larger, embodiment of the present invention is to concrete functional form and the indefinite of f (PublishTime).
G (SiteHotness
k) be that temperature is transferred weight function, be used for guaranteeing to reprint weight CitationHotness
kQuality index.Usually, the relative temperature value SiteHotness of some websites
kHigher, then it reprints weight CitationHotness
kValue should be larger.
Similarly, temperature is transferred weight function g (SiteHotness
k) concrete functional form numerous embodiments can be arranged, can be linear, also can be nonlinear.In fact, as long as meet the relative temperature value SiteHotness of website
kHigher, then temperature is transferred weight function CitationHotness
kThe larger cardinal rule of value, embodiment of the present invention is to concrete functional form and the indefinite of f (PublishTime).
In one embodiment, this system further comprises hot information display unit 204.Hot information display unit 204 is used for showing the described hot information of determining from reprinting information.Such as, hot information display unit 204 can set in advance as showing 10 hot informations; After the heatrate value of each being reprinted information according to the height ordering is divided into line ordering, select from high to low 10 news numbers to represent as hot information.
Can according to embodiment of the present invention, from numerous news sources of internet, excavate hot news.Based on above-mentioned labor, Fig. 3 is the exemplary hot news mining process schematic diagram according to embodiment of the present invention.
As shown in Figure 3, at processing block 1 place, crawl out magnanimity news from the numerous news sources (such as news website) that come from the internet, and identify the concrete papers published of news, namely identify the reprinting which news belongs to same piece of writing news.
Such as: concrete recognition technology herein can be used based on the similarity of text feature and calculate.
Exemplarily, Fig. 4 is the reprinting news recognition result schematic diagram according to embodiment of the present invention.
The news of " China's Software Market was expected to reach 71,500,000,000 yuan in 2015 " from the different messages source shown in Figure 4 is actually the reprinting news of same news.
In processing block 2, by to the hot news access log of numerous news websites and the access log of other news, calculate the relative temperature value (namely accessing temperature) of each news website.
The relative temperature value calculating method of each website is as follows:
Wherein K is the set of all websites, and norm is normalization coefficient, and AccessCount is the access times of each news website.
In processing block 3, in conjunction with the reprinting recognition result of processing block 1, the issuing time of reprinting news and the relative temperature value of each news website that processing block 2 calculates.
Such as: such as: for i reprinting news, calculate its news temperature value NewsHotness
i
Wherein:
CitationHotness
k=g(SiteHotness
k);
Wherein K is that all reprinted this i set of reprinting the news website of news; PublishTime is this i issuing time of reprinting news; F (PublishTime) transfers weight function, CitationHotness about the time of PublishTime
kReprint news and k reprinting reprinting weight in the news website of this reprinting news, g (SiteHotness are arranged for this i
k) be about SiteHotness
kTemperature transfer weight function.
Time transfers weight function f (PublishTime) to be used for guaranteeing news temperature value NewsHotness
iTimeliness n.Usually, issuing time PublishTime is the closer to current time, and then the value of time accent weight function f (PublishTime) should be larger.
Time transfers the concrete functional form of weight function f (PublishTime) that numerous embodiments can be arranged, and can be linear, also can be nonlinear.As long as meet issuing time PublishTime the closer to current time, the cardinal rule that then value of time accent weight function f (PublishTime) should be larger, embodiment of the present invention is to concrete functional form and the indefinite of f (PublishTime).
In processing block 4, determine hot news according to the result of calculation of processing block 3, and by various ways such as microblogging, webpage, Emails hot news is pushed to the user.Determine after the hot news, hot news can be kept in the hot news access log, thus be convenient to the user at any time recall access.
For example: Fig. 5 shows schematic diagram according to the hot information of embodiment of the present invention.And embodiment of the present invention preferably demonstrates the concrete source of this hot information in pushing the result.
In embodiment of the present invention, at first according to the relative temperature value between the access times computing information webpage source in Intelligence Page source; Then calculate each reprinting information according to the relative temperature value in Intelligence Page source and in reprinting reprinting weight in the Intelligence Page source of this reprinting information is arranged; And each reprinting weight of reprinting information sued for peace, calculate the heatrate value that each reprints information, from reprinting information, determine hot information according to the size order of heatrate value again.This shows, use after the embodiment of the present invention, can be based on the automatic Heat of Formation dot information of heatrate value of the information of reprinting from whole internet, therefore can save artificial and reduce cost.
And embodiment of the present invention can also support the hot news of any amount to represent demand, and can support the calculating based on whole internet news, and the automatic mining by technorati authority, can dynamically eliminate website inferior, strengthen the high-quality website, so that Mining Quality is continued to optimize.
The above is preferred embodiment of the present invention only, is not for limiting protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.