CN1822000A - Method for automatic detecting news event - Google Patents

Method for automatic detecting news event Download PDF

Info

Publication number
CN1822000A
CN1822000A CN200610007219.XA CN200610007219A CN1822000A CN 1822000 A CN1822000 A CN 1822000A CN 200610007219 A CN200610007219 A CN 200610007219A CN 1822000 A CN1822000 A CN 1822000A
Authority
CN
China
Prior art keywords
report
incident
event
news
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200610007219.XA
Other languages
Chinese (zh)
Other versions
CN100461177C (en
Inventor
路斌
杨霙
杨建武
万小军
吴於茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Peking University Founder Research and Development Center
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University, Peking University Founder Group Co Ltd filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CNB200610007219XA priority Critical patent/CN100461177C/en
Publication of CN1822000A publication Critical patent/CN1822000A/en
Application granted granted Critical
Publication of CN100461177C publication Critical patent/CN100461177C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention puts forward a practical news item test method, which introduces events, sequences, merges and adjusts them, eliminates news reports and describes news events to obviously increase the test result of news events, strengthen its practicability and can be widely used in intelligent information process.

Description

A kind of method of automatic detection media event
Technical field
The invention belongs to and belong to intelligent information processing technology, be specifically related to a kind of method of automatic detection media event.
Background technology
Along with developing rapidly of the Internet, news information presents volatile growth.How from the news report that continues to bring out, to obtain de novo hot news event information in time, and own interested media event is continued to follow the trail of, become the research focus in recent years.Topic detection and tracer technique are attempted solution to this problem just.
Topic detection and tracking (TDT) research starts from 1996, people such as research promoter at that time and participant James Allan have defined specific tasks and the Performance Evaluation index of TDT in " Topic Detection and Tracking (TDT) Pilot StudyFinal Report ", and have provided some experimental results at that time.Three main tasks of TDT are respectively:
(1) news report segmentation task: the voice or the written record of continuous broadcasting, TV news program are divided into different reports;
(2) event detection task: identify the incident of system's the unknown, and relevant report is also identified;
(3) track of issues task: monitoring news report information flow is to find the new report relevant with a certain known event.
In addition, this paper is spoken of, and the research emphasis of paying close attention at present among the TDT is the detection and the tracking of incident, and wherein, theme is a notion more wide in range than incident, and a theme can comprise a plurality of dependent events.
From in essence, event detection is that news report stream is done cluster according to different incidents, the report that an incident is discussed need be classified as a class (James Allan, 2002).Compare with common text cluster, the singularity of event detection mainly shows two aspects: at first, the process object of event detection is the news report stream that occurs successively in chronological order, dynamic change in time, rather than the sealing text collection of a static state; Secondly, event detection is to carry out cluster according to incident rather than subject categories that report is discussed, and the information granularity of institute's foundation is little relatively, and therefore the class that is obtained by event detection should be more some more.However, the text cluster technology remains the basis of event detection technology.
Event detection can be subdivided into according to concrete detection scene: recall and detect and online detection.The purpose of recalling detection is the theme of news that does not identify before finding from existing news report set, requires the information of system's output theme of news, and the incidence relation of news report and theme can be described.And online detection focus in time the new theme of sign from real-time news report stream, just express moment that the report of new theme occurs and identify this theme of news at certain.
In the past few years, event detection researcher has attempted multiple different text cluster method, as single pass cluster, k-means cluster, level cohesion cluster, probability model etc.Introduce several main existing event detecting methods below:
(1) method of CMU
The researcher of CMU (Yiming Yang etc.) is the main single pass clustering algorithm that has time window that adopts in event detection.The researcher of CMU is expressed as a vector in the space with every piece of report and each incident, similarity between report vector sum event vector is calculated the main vector angle cosine value that adopts, but to utilize an event window to adjust according to time factor, can take two kinds of strategies.First kind of strategy only considered the incident that occurs in time window, second kind of strategy thought along with the increase of reporting quantity between current report s and the incident c, should reduce the similarity value of the two.
In addition, on SIGKDD in 2002, Yiming Yang etc. propose a kind of event detecting method based on theme in article " Topic-conditioned novelty detection ": in the more wide in range subject categories that the learning algorithm that at first using has supervision goes into to pre-define with the online document flow point, in conjunction with the feature of each theme document flow is carried out new events then and detect.
(2) method of University of Massachusetts
The researcher of University of Massachusetts (James Alan etc.) represents news report with vector model, and core algorithm still adopts the single pass clustering algorithm.When calculating report and incident similarity, adopted time-based threshold model, utilized linear function to adjust the cluster threshold value, feasible difficult more this incident of adding of news report far away more apart from certain incident in time.When determining the incident the most close, except original barycenter comparison strategy, increased the nearest-neighbors comparison strategy with current report.
In the barycenter comparison strategy, be provided with two threshold value θ match and θ certain.If the barycenter similarity of current report and certain incident is higher than θ match, then this report is included into this incident.But have only when the similarity value is higher than θ certain between them, just adjust the barycenter of this incident, i.e. the vector representation of this incident with current report.And the nearest-neighbors comparison strategy is at first sought the k piece of writing report the most similar to current report when discerning in existing report, is reported and pre-set threshold is determined the incident that current report should belong to by this k piece of writing.If can not be included into any one known incident to it, just it as reported first, for it sets up a new events to certain new events.
In addition, James Alan etc. mentions with the highest several speech of the frequency of occurrences in the incident as event description.
(3) method of IBM Corporation
A relatively successful event detection system of IBM Corporation's exploitation has adopted a kind of two-layer cluster strategy, the similarity of using symmetrical Okapi formula to come two pieces of reports of comparison.The processing first time of this system will report at first and temporarily be included into different microevent (microcluster) that handling for the second time is that process object forms bigger class with these microevents again, promptly is included into final incident (Dharanipragada etc., 2002).More than each processing all adopt the single pass clustering algorithm, difference only is that process object is different and choose different threshold values.
In sum, in event detection procedure, step commonly used can be summarized as follows in the prior art:
1) reads in one piece of report from data source, comprise content, time and other relevant information; Data source may exist a plurality of, may not have obvious limit between the report, the pre-service such as cutting between need reporting;
2) adopt barycenter comparison or arest neighbors comparison strategy, calculate the similarity between report and incident or report and report, determine and the most close incident of current report;
3) if report is included into certain incident, then adjust this incident; If report can't be included into existing incident, then it is classified as new detected incident;
4) export detected incident, weight is the highest in the incident several characteristic speech or representative certain are reported that title is as event description.
Because existing event detection technology is only considered the fallout ratio and the loss of closing in fixing small data set to have following defective:
(1) ordering of events problem
People's notice becomes a kind of scarce resource, and people are often not free to go to check a large amount of media events that so the ordering of the media event of hottest point should be forward more, such system could satisfy people's needs better.Prior art is not considered this problem, only is the detected incident of simple output.
(2) incident similarity problem
Because it is less to the news possibility similarity that same media event different aspect is reported, thereby make same media event be divided into a plurality of mishaps at the incident early period of origination, and then along with the continuous development of the state of affairs, the similarity of these incidents may be increasing, brings fascination and inconvenience so just may for browsing of user.Prior art is not considered this problem yet.
(3) news report is eliminated problem
In actual application environment, event detection is a long-term process that continues.Along with the dynamic evolution of incident, some news in the incident and the correlativity of this incident are reducing gradually.In addition, long incident of cycle is along with the accumulation of time also swelling may occur, and the whole event content is too wide in range.Prior art overcomes the dynamically problem of evolution of incident by introducing time window strategy and dynamic adjustment incident, but considers it is reported the problem of eliminating.
(4) event description problem
The description of media event at present has two kinds of methods: most important several features speech in this incident, perhaps choose certain headline in this incident.Because natural language processing technique is ripe not enough, the feature speech of extraction is difficult to effectively describe incident, even feature speech such as most important name, place name, mechanism's name, time possibly can't extract in the media event, for example Eleventh-Five Year Plan, No. six, Divine Land etc.And if with certain reports title as description in the incident, for some comprehensive incidents, then this report may only be an aspect of incident, and is comprehensive inadequately to the description of incident.
Summary of the invention
At the defective that exists in the prior art, the objective of the invention is to utilize the characteristics of media event itself, by solving ordering of events, incident merges and adjusts, news report is eliminated, and problem such as media event description, realizes flowing to action attitude, event detection efficiently to continuing news.
For reaching above purpose, the technical solution used in the present invention is: a kind of method of automatic detection media event may further comprise the steps:
1) reads in one piece of report from data source, and report is carried out pre-service;
2) similarity between calculating report and detected incident or report and report is determined the incident relevant with current report, and is included into dependent event;
3) if report is included into certain existing incident, then adjust this incident; If report can't be included into existing incident, then it is classified as new detected incident;
4) detected incident is compared in twos, merge dependent event, and readjust the similarity of incident and report and incident;
5) report that does not satisfy restrictive condition in each incident is eliminated, and the adjustment incident;
6) more current event number and time window size is if event number, is then carried out ordering of events greater than the event window size and eliminated; Otherwise change step 7 over to;
7) output testing result.
Further, for making the present invention obtain better invention effect, in the step 1), if new report and treated before news report similarity promptly repeat threshold value greater than pre-set threshold θ d, then think the news report of repetition, need disappear to news report and heavily handle, described θ d span is 0<θ d≤1, and the described heavily processing that disappears is to carry out according to the content employing text retrieval of news report and the similarity calculating method in the text mining.
In the step 1), adopt the method for classification automatically that news report is classified by pre-set classification earlier.
Adopting the method for classification automatically that news report is carried out the branch time-like in the step 1), is the method that adopts rule classification and content-based automatic classification based on the source to combine, and content-based automatic classification is the text classification technology that adopts.The method of a kind of automatic detection media event as claimed in claim 4 is characterized in that: described text classification technology is based on the algorithm of support vector machine of vector space model.
Further, for making the present invention obtain better invention effect, step 2) described in determine the incident relevant with current report the time employing barycenter comparison or arest neighbors comparison strategy, similarity calculating method can adopt the technology of existing text mining, and document model is based on vector space model, probability model or language model; The similarity formula adopts included angle cosine or Hellinger range formula etc.; Similarity is calculated and is considered in conjunction with the temporal characteristics of report and the temporal characteristics of incident.
Step 2) carrying out similarity when calculating in, the title in reporting is with higher weight, perhaps for the higher report of authority with higher weights, the authority of report adopts the authority of news sources.
Further, obtain better invention effect for making the present invention, the measurement of similarity between the incident described in the step 4) is to adopt the cluster similarity value of calculating in traditional clustering algorithm; If the similarity of two incidents is greater than merging threshold value θ u, it is relevant then to be considered as two incidents, and with its merging, described θ u span is 0<θ u≤1.Simultaneously, incident merges also can adopt other strategies, and for example, if the certain characteristics speech is identical in the internal representation of two incidents, it is higher then to be considered as similarity, merges this two incidents.
Further, obtain better invention effect for making the present invention, the restrictive condition described in the step 5) can be similarity threshold or time restriction, also can be that outside limits is as report attention rate, user click frequency etc.
Further again, step 4) is or/and 5) in, behind the newly-increased report of every process user institute quantification, one section user of perhaps every operation is after the determined time, perhaps detected incident whenever Adds User after the determined quantity, carries out step 4) or/and 5 again) operation.
Further, for making the present invention obtain better invention effect, when calculating the ordering of incident in the step 6), need be in conjunction with the time response and the quantitative characteristics of media event, for example the number with newly-increased report in (for example 12 hours) incident in nearest certain time range gets score value as incident; In addition, in ordering, can consider a plurality of different orderings simultaneously, for example consider simultaneously nearest 12 hours, 1 day, 3 days, 7 days, 30 days etc. to have only when incident and in any ordering, not in event window the time, just this incident is eliminated; Like this, multiple ordering just can provide varigrained information reference to the user.
When calculating ordering of events in the step 6), can integrating step 6) in a plurality of ranking results, output meets certain ordering of customer requirements, perhaps exports a plurality of orderings simultaneously, for example the user can ask to check in 1 day simultaneously and the incident of hottest point in 1 week.
Further, obtain better invention effect, when exporting testing result in the step 7),, calculate and describe for current all incidents for making the present invention; Simultaneously, binding time characteristic and quantitative characteristics are calculated the incident score and incident are sorted, and select the higher media event of score as the highlight incident, the news report tabulation that outgoing event is described and comprised, and wherein, the generative process of event description is as follows:
A) the feature speech of the user institute quantification that the inner weight of selection incident is the highest;
B), choose the title of the most representative one piece of news report in this incident according to the news report selection strategy;
C) comprehensive a) and b), the description of exporting this incident.
Described representative news report selection strategy in the step 7) is the threshold strategies of relevant informations such as the authority, report clicking rate, report time in conjunction with source of news, described threshold strategies is predefined event threshold θ e, and described θ e span is 0<θ e≤1; For example with the incident similarity greater than in the incident of threshold value the news report in, the title of one piece of news report that select time is nearest.Or export maximally related news report according to the ratio that the user determines.
Effect of the present invention is: the present invention is in the feature that has taken into full account media event, and on people's the cognitive law basis, at the ordering of events in the practical application, incident merges and adjusts, news report is eliminated, and the media event description etc., provided actual solution.Experiment shows, adopts method of the present invention, obviously improves the detection effect of media event, thereby strengthens its practicality greatly.
Why the present invention has the foregoing invention effect, is because the present invention has following characteristics:
(1) aspect ordering of events, introduce at a time incident is calculated the mechanism that importance gets score value, this mechanism is taken all factors into consideration the time response and the quantitative characteristics of media event, and then at a time for each incident provides a more rational score value that gets, is used for ordering of events.
(2) aspect the incident similarity, the mechanism that the introducing incident merges and adjusts is used to overcome the phenomenon that same media event is divided into a plurality of mishaps by mistake.The news report of every processing fixed number just compares between any two to incident, if judge that according to comparison strategy two incident similarities are higher, then carries out the merging and the adjustment of incident.
(3) aspect news report, the mechanism that news report is eliminated in the introducing incident is used to overcome the too wide in range phenomenon of media event content.The news report of every processing fixed number is just eliminated the news report in each incident.
(4) aspect event description, proposed feature speech and the method that the news report title combines are used to overcome both defectives.At first, the highest several features speech of the inner weight of selection incident is as the part of event description; Simultaneously,, choose the most representative one piece of news report in this incident, with the title of this report a part as event description according to the report selection strategy.
Description of drawings
Fig. 1 is a process flow diagram of the present invention;
Fig. 2 is the result schematic diagram that detects media event during adopting existing method on August 9,22 to 2005 year July in 2005;
Fig. 3 is the result schematic diagram that detects media event during adopting the method for the invention on August 9,22 to 2005 year July in 2005;
Fig. 4 is Sina website on the 9th important news sectional drawing August in 2005;
Fig. 5 is the result schematic diagram that detects media event during adopting existing method on October 9,22 to 2005 year July in 2005;
Fig. 6 is the result schematic diagram that detects media event during adopting the method for the invention on October 9,22 to 2005 year July in 2005;
Fig. 7 is Sina website on the 9th important news sectional drawing October in 2005.
Embodiment
Below in conjunction with drawings and Examples the present invention is done to describe further:
As shown in Figure 1, a kind of method of automatic detection media event may further comprise the steps:
1) reads in one piece of report from data source, (for example Sina website, the www.xinhuanet.com, People's Net etc.) are detected incessantly to a plurality of news network data sources, from network, grasp news report automatically, parse time, title and the text message etc. of news report, if from report, do not find the time, then be as the criterion with the extracting time;
Owing to have suitable repetition between a plurality of data sources,, disappear according to the content of text of report and heavily to handle the news report of new extracting; If new report and treated before news report multiplicity are then thought the news report of repetition greater than repeating threshold value θ d, the repetition threshold value θ d that sets in the present embodiment is 0.9;
Because the scope of news report is too wide in range, employing is based on the rule classification and the method that combines of content-based automatic classification in source, to news report classify (classification is pre-set, for example with reference to the channel of Sina website, can be divided into news, science and technology, finance and economics, physical culture etc.).Rule classification is classified according to source of news and author etc., and for example the content from Sina's " home news " channel is included into " home news " classification, is included into " science and technology " classification from the content of www.xinhuanet.com's " science and technology " channel.Vector space model and algorithm of support vector machine are adopted in content-based automatic classification, according to report content and title news report are classified automatically; And carry out step 2 according to affiliated classification c)-processing of step 7);
2) adopt the barycenter comparison strategy, will report with the interior existing detected media event of affiliated classification c to compare, consider temporal characteristics and content characteristic simultaneously, calculate the similarity between report and incident, and record maximum similarity S MaxAnd the incident Es of similarity maximum, determine and the most close incident of current report; Incident itself is expressed by the highest several features speech of comprehensive weight in inner all news of incident; Similarity between news report and the incident is calculated by both included angle cosine values (cosine) based on vector space model, and Xin Wenbaodao title is given higher weights simultaneously.
3) according to step 2) the maximum similarity S that calculates MaxAnd the incident Es of similarity maximum, current report is taked following measure:
If a) S MaxLess than innovating threshold value θ n (being 0.25 in the present embodiment): under this report, create a new events in the classification;
B) if S MaxGreater than θ n less than cluster threshold value θ c (being 0.30 in the present embodiment): do not deal with, return step 1);
C) if S MaxGreater than θ c less than contribution threshold θ t (being 0.35 in the present embodiment): be included into current event;
D) if S MaxGreater than θ t: be included into incident Es, and adjust Es;
Above-mentioned S Max, θ n, θ c, θ t span all greater than 0 smaller or equal to 1.
4) after the newly-increased report of the fixed qty (quantity of determining in the present embodiment is 20) that a class process user is determined, media event in this classification is compared in twos; If the similarity of two incidents is greater than merging threshold value θ u (for example 0.20), then with its merging.Calculating formula of similarity between the incident can adopt the method for calculating two cluster similarities in traditional clustering algorithm, for example based on vector space model, takes all factors into consideration the similarity in twos between all news report in two incidents, adopts following formula:
Sim ( E 1 , E 2 ) = Σ d i ∈ E 1 Σ d j ∈ E 2 sim ( d i , d j ) | E 1 | · | E 2 |
Wherein, E 1, E 2Be two detected media events, d i, d jBe respectively E 1, E 2In news report, sim (d i, d j) be two similarities between the news report, | E 1|, | E 2| be respectively the news report number that comprises in two incidents;
5) after the newly-increased report of the fixed qty (quantity of determining in the present embodiment is 20) that a class process user is determined, news report in each incident is eliminated: recomputate the news report and the similarity of this incident, the news report that similarity is lower than cluster threshold value θ c or does not satisfy restrictive condition (for example whether report is in nearly 30 days) is eliminated; And then recomputate incident internal representation and weight thereof;
6) if the event number in the current classification surpasses the event window size, all media events in the classification are sorted: in conjunction with the time response and the quantitative characteristics of media event, calculate media event score value preface side by side; Consider a plurality of different orderings when calculating score value simultaneously, considered simultaneously nearest 12 hours, 1 day, 3 days, 7 days, 30 days etc. to have only when incident and in any ordering, not in event window the time, just this incident is eliminated; Like this, multiple ordering just provides varigrained information reference to the user.The system not media event in the incident window eliminates, and is used to improve the efficient of system handles;
7) according to customer requirements, externally export testing result:, calculate its description for current all incidents in the classification; Simultaneously, the news report quantitative characteristics in the time response of binding events and the incident is selected several the highest media events of score from all categories, as the media event of this classification hottest point, and the news report tabulation that outgoing event is described and comprised.Wherein, the generative process of event description is as follows:
A) read the highest several features speech of the inner weight of incident;
B) with the incident similarity greater than the incident of event threshold θ e (being 0.6 in the present embodiment) in the news report in, the title of one piece of news report that select time is nearest; Event threshold can also be taked proportionally the mode of (20%).
C) comprehensive a) and b), the description of exporting this incident.
In order to verify validity of the present invention, we do test from 100,000 pieces of news language materials of website partial channels such as Sina website, the www.xinhuanet.com, People's Net (news, science and technology, physical culture etc.) extracting during adopting 2005-7-22 to 2005-10-9, and 100,000 news language materials are divided into 3 big classes: news, science and technology, physical culture.Evaluation index adopts the verification and measurement ratio (arrangement forms because Sina website news channel important news hurdle is the human-edited, so the Sina website news channel important news hurdle of getting with the time period compares as expert result) of grave news incident.We are example with " news " class, and test findings is described, experimental result such as Fig. 2 are to shown in Figure 7.
All to be contrast method of the present invention and classic method detect the grave news incident (being detected related news quantity its bracket in) of detected ordering top 10 of intermission and the tabulation for the grave news incident at 21 o'clock of the same day of Sina website news channel important news hurdle in news to Fig. 2 to Fig. 7.Wherein, be on August 9,22 days to 2005 July in 2005 news detection time of Fig. 2 to Fig. 4, and it is on October 9,22 days to 2005 July in 2005 that the news of Fig. 5 to Fig. 7 detects the intermission.Wherein, classic method is the single pass clustering algorithm of employing such as Yiming Yang: ordering of events directly adopt event detection to the order inverted order arrange (being that up-to-date detected event column is topmost), incident is eliminated the method (incident that every ordering exceeds event window all is eliminated) that adopts event window, and event description adopts proposition keyword describing methods such as James Allan.
From Fig. 2 to Fig. 7 as can be seen, the method that the present invention proposes is better than classic method, comprising:
1. ordering of events is more reasonable; Can see from Fig. 2 to Fig. 7, preceding ten incidents Sina's main thematic verification and measurement ratio on the same day reached 62.5% and 57% respectively in the method that the present invention proposes;
2. reduced the situation that same incident is divided into a plurality of mishaps by mistake; The 3-6 incident all is to commemorate the Anti-Japanese War 60 anniversaries of triumph among Fig. 2, is divided into a plurality of incidents in classic method, and is the 4th incident among Fig. 3 by unification in the method that the present invention proposes;
3. media event is described more accurately comprehensively; " No. six, Divine Land " incident for example, by the description of the 3rd incident among Fig. 5, can be more accurate comprehensively than simple keyword or simple representative headline.
In addition, because the news report that the method that the present invention proposes has been introduced in the media event is eliminated mechanism, the content of media event is more concentrated.
Experiment shows: because classic method is only considered the fallout ratio and the loss of closing in fixing small data set to have many defectives in actual application environment.And the method that the present invention proposes has taken into full account the feature that media event takes place, and people's cognitive law, makes the detection effect of media event obtain obviously to improve, and strengthens its practicality greatly.
In the actual application, content-based automatic classification can also be adopted other text classification technology, for example based on the KNN algorithm of language model; Step 2) in, when determining the incident the most close, can also adopt the barycenter comparison strategy with current report.Therefore, method of the present invention is not limited to the embodiment described in the embodiment, so long as those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.

Claims (15)

1. method that automatically detects media event may further comprise the steps:
1) reads in one piece of report from data source, and report is carried out pre-service;
2) similarity between calculating report and detected incident or report and report is determined the incident relevant with current report, and is included into dependent event;
3) if report is included into certain existing incident, then adjust this incident; If report can't be included into existing incident, then it is classified as new detected incident;
4) detected incident is compared in twos, merge dependent event, and readjust the similarity of incident and report and incident;
5) report that does not satisfy restrictive condition in each incident is eliminated, and the adjustment incident;
6) more current event number and time window size is if event number, is then carried out ordering of events greater than the event window size and eliminated; Otherwise change step 7 over to;
7) output testing result.
2. the method for a kind of automatic detection media event as claimed in claim 1, it is characterized in that: in the step 1), if new report and treated before news report similarity promptly repeat threshold value greater than pre-set threshold θ d, then think the news report of repetition, need disappear to news report and heavily handle, described θ d span is 0<θ d≤1, and the described heavily processing that disappears is to carry out according to the content employing text retrieval of news report and the similarity calculating method in the text mining.
3. the method for a kind of automatic detection media event as claimed in claim 1 or 2 is characterized in that: in the step 1), adopt the method for classification automatically that news report is classified by pre-set classification earlier.
4. the method for a kind of automatic detection media event as claimed in claim 3 is characterized in that:
Adopting the method for classification automatically that news report is carried out the branch time-like in the step 1), is the method that adopts rule classification and content-based automatic classification based on the source to combine, and content-based automatic classification is the text classification technology that adopts.
5. the method for a kind of automatic detection media event as claimed in claim 4 is characterized in that: described text classification technology is based on the algorithm of support vector machine of vector space model.
6. the method for a kind of automatic detection media event as claimed in claim 1, it is characterized in that: step 2) in employing barycenter comparison or arest neighbors comparison strategy when determining the incident relevant with current report, similarity calculating method is the technology that adopts text mining, and document model is based on vector space model, probability model or language model; The similarity formula is to adopt included angle cosine or Hellinger range formula; Similarity is calculated also and is considered in conjunction with the temporal characteristics of report and the temporal characteristics of incident.
7. the method for a kind of automatic detection media event as claimed in claim 6 is characterized in that:
Step 2) carrying out similarity when calculating in, the title in reporting is with higher weight, perhaps for the higher report of authority with higher weights, the authority of report adopts the authority of news sources.
8. the method for a kind of automatic detection media event as claimed in claim 1 is characterized in that: the measurement of similarity between the incident described in the step 4) is to adopt the cluster similarity value of calculating in traditional clustering algorithm; If the similarity of two incidents is greater than merging threshold value θ u, it is relevant then to be considered as two incidents, and with its merging, described θ u span is 0<θ u≤1; Perhaps, if the certain characteristics speech is identical in the internal representation of two incidents, it is higher then to be considered as similarity, merges this two incidents.
9. the method for a kind of automatic detection media event as claimed in claim 1 is characterized in that: the restrictive condition described in the step 5) is similarity threshold, time restriction or outside limits.
10. a kind of as claimed in claim 8 or 9 method of automatic detection media event, it is characterized in that: step 4) is or/and 5) in, behind the newly-increased report of every process user institute quantification, one section user of perhaps every operation is after the determined time, perhaps detected incident whenever Adds User after the determined quantity, carries out step 4) or/and 5 again) operation.
11. the method for a kind of automatic detection media event as claimed in claim 1 is characterized in that: in step 6), in conjunction with the time response and the quantitative characteristics of media event, calculate media event score value preface side by side; System only preserves the media event of fixed number, and the media event after ordering is leaned on is eliminated.
12. the method for a kind of automatic detection media event as claimed in claim 11 is characterized in that: when in step 6), calculating ordering of events, need be in conjunction with the time response and the quantitative characteristics of media event; In ordering, consider a plurality of orderings simultaneously by different time sections, have only when incident and in any ordering, not in event window the time, just this incident is eliminated.
13. method as claim 11 or 12 described a kind of automatic detection media events, it is characterized in that: when step 6) is calculated ordering of events, integrating step 6) a plurality of ranking results in, output meets certain ordering of customer requirements, perhaps exports a plurality of orderings simultaneously.
14. the method for a kind of automatic detection media event as claimed in claim 13 is characterized in that: during step 7) output testing result,, calculate event description for current all incidents; Simultaneously, the time response of binding events and quantitative characteristics sort to incident, and select the higher media event of score as the highlight incident, the news report tabulation that outgoing event is described and comprised, and wherein, the generative process of event description is as follows:
A) the feature speech of the user institute quantification that the inner weight of selection incident is the highest;
B), choose the title of the most representative one piece of news report in this incident according to the news report selection strategy;
C) comprehensive a) and b), the description of exporting this incident.
15. the method for a kind of automatic detection media event as claimed in claim 14, it is characterized in that: the described news report selection strategy in the step b) is authority, report clicking rate, the threshold strategies of report time in conjunction with source of news, described threshold strategies is predefined event threshold θ e, and described θ e span is 0<θ e≤1.
CNB200610007219XA 2006-02-14 2006-02-14 Method for automatic detecting news event Expired - Fee Related CN100461177C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200610007219XA CN100461177C (en) 2006-02-14 2006-02-14 Method for automatic detecting news event

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200610007219XA CN100461177C (en) 2006-02-14 2006-02-14 Method for automatic detecting news event

Publications (2)

Publication Number Publication Date
CN1822000A true CN1822000A (en) 2006-08-23
CN100461177C CN100461177C (en) 2009-02-11

Family

ID=36923366

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200610007219XA Expired - Fee Related CN100461177C (en) 2006-02-14 2006-02-14 Method for automatic detecting news event

Country Status (1)

Country Link
CN (1) CN100461177C (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546359A (en) * 2009-04-28 2009-09-30 上海银晨智能识别科技有限公司 Human body biological information sorting system and sorting method
CN101231640B (en) * 2007-01-22 2010-09-22 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN102622378A (en) * 2011-01-30 2012-08-01 北京千橡网景科技发展有限公司 Method and device for detecting events from text flow
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
CN103077190A (en) * 2012-12-20 2013-05-01 人民搜索网络股份公司 Hot event ranking method based on order learning technology
CN103116651A (en) * 2013-03-05 2013-05-22 南京理工大学常熟研究院有限公司 Public sentiment hot topic dynamic detection method
CN103164427A (en) * 2011-12-13 2013-06-19 ***通信集团公司 Method and device of news aggregation
CN104636461A (en) * 2015-02-06 2015-05-20 北京中搜网络技术股份有限公司 Dynamic event clustering and extracting method based on KNN
CN105046497A (en) * 2007-11-14 2015-11-11 潘吉瓦公司 Evaluating public records of supply transactions
CN106021063B (en) * 2016-05-09 2018-05-29 北京蓝海讯通科技股份有限公司 Method, application and the system of polymerization events message
CN109299266A (en) * 2018-10-16 2019-02-01 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
US10430846B2 (en) 2007-11-14 2019-10-01 Panjiva, Inc. Transaction facilitating marketplace platform
US10949450B2 (en) 2017-12-04 2021-03-16 Panjiva, Inc. Mtransaction processing improvements
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product
US11514096B2 (en) 2015-09-01 2022-11-29 Panjiva, Inc. Natural language processing for entity resolution
US11551244B2 (en) 2017-04-22 2023-01-10 Panjiva, Inc. Nowcasting abstracted census from individual customs transaction records

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5442778A (en) * 1991-11-12 1995-08-15 Xerox Corporation Scatter-gather: a cluster-based method and apparatus for browsing large document collections
EP1324229A3 (en) * 2001-12-27 2006-02-01 Ncr International Inc. Using point-in-time views to provide varying levels of data freshness
CN1710563A (en) * 2005-07-18 2005-12-21 北大方正集团有限公司 Method for detecting and abstracting importent new case

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231640B (en) * 2007-01-22 2010-09-22 北大方正集团有限公司 Method and system for automatically computing subject evolution trend in the internet
CN105046497A (en) * 2007-11-14 2015-11-11 潘吉瓦公司 Evaluating public records of supply transactions
US10885561B2 (en) 2007-11-14 2021-01-05 Panjiva, Inc. Transaction facilitating marketplace platform
US10504167B2 (en) 2007-11-14 2019-12-10 Panjiva Inc. Evaluating public records of supply transactions
US10430846B2 (en) 2007-11-14 2019-10-01 Panjiva, Inc. Transaction facilitating marketplace platform
CN101546359A (en) * 2009-04-28 2009-09-30 上海银晨智能识别科技有限公司 Human body biological information sorting system and sorting method
CN102622378A (en) * 2011-01-30 2012-08-01 北京千橡网景科技发展有限公司 Method and device for detecting events from text flow
CN103164427B (en) * 2011-12-13 2016-03-02 ***通信集团公司 News Aggreagation method and device
CN103164427A (en) * 2011-12-13 2013-06-19 ***通信集团公司 Method and device of news aggregation
CN102945246B (en) * 2012-09-28 2015-12-02 北界创想(北京)软件有限公司 The disposal route of network information data and device
CN102945246A (en) * 2012-09-28 2013-02-27 北界创想(北京)软件有限公司 Method and device for processing network information data
CN103077190A (en) * 2012-12-20 2013-05-01 人民搜索网络股份公司 Hot event ranking method based on order learning technology
CN103020251A (en) * 2012-12-20 2013-04-03 人民搜索网络股份公司 Automatic mining system and method of news events in large-scale data
CN103116651A (en) * 2013-03-05 2013-05-22 南京理工大学常熟研究院有限公司 Public sentiment hot topic dynamic detection method
CN104636461A (en) * 2015-02-06 2015-05-20 北京中搜网络技术股份有限公司 Dynamic event clustering and extracting method based on KNN
US11514096B2 (en) 2015-09-01 2022-11-29 Panjiva, Inc. Natural language processing for entity resolution
CN106021063B (en) * 2016-05-09 2018-05-29 北京蓝海讯通科技股份有限公司 Method, application and the system of polymerization events message
US11551244B2 (en) 2017-04-22 2023-01-10 Panjiva, Inc. Nowcasting abstracted census from individual customs transaction records
US10949450B2 (en) 2017-12-04 2021-03-16 Panjiva, Inc. Mtransaction processing improvements
CN109376231A (en) * 2018-09-29 2019-02-22 杭州凡闻科技有限公司 A kind of media hotspot tracking and system
CN109299266A (en) * 2018-10-16 2019-02-01 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN109299266B (en) * 2018-10-16 2019-11-12 中国搜索信息科技股份有限公司 A kind of text classification and abstracting method for Chinese news emergency event
CN112926298A (en) * 2021-03-02 2021-06-08 北京百度网讯科技有限公司 News content identification method, related device and computer program product

Also Published As

Publication number Publication date
CN100461177C (en) 2009-02-11

Similar Documents

Publication Publication Date Title
CN1822000A (en) Method for automatic detecting news event
CN1290039C (en) Automatic system and method for analysing content of audio signals
CN101719167B (en) Interactive movie searching method
US9311395B2 (en) Systems and methods for manipulating electronic content based on speech recognition
US10318543B1 (en) Obtaining and enhancing metadata for content items
CN105260359A (en) Semantic keyword extraction method and apparatus
CN102799605A (en) Method and system for monitoring advertisement broadcast
CN109145180B (en) Enterprise hot event mining method based on incremental clustering
CN112256843B (en) News keyword extraction method and system based on TF-IDF method optimization
CN103310003A (en) Method and system for predicting click rate of new advertisement based on click log
CN102073684B (en) Method and device for excavating search log and page search method and device
CN105095210A (en) Method and apparatus for screening promotional keywords
CN101609450A (en) Web page classification method based on training set
Jin et al. Patent maintenance recommendation with patent information network model
Lipczak et al. Efficient tag recommendation for real-life data
US10467255B2 (en) Methods and systems for analyzing reading logs and documents thereof
CN103810162A (en) Method and system for recommending network information
CN103942328A (en) Video retrieval method and video device
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
CN102654875B (en) Method and device for automatically processing inner link of web text
CN1641634A (en) Chinese new word and expression detecting method and its detecting system
CN102622353A (en) Fixed audio retrieval method
CN113282641A (en) Webpage search data information intelligent classification management method and system based on user behavior deep analysis and computer storage medium
CN112418269B (en) Social media network event propagation key time prediction method, system and medium
CN103116651A (en) Public sentiment hot topic dynamic detection method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220908

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: PEKING University FOUNDER R & D CENTER

Patentee after: Peking University

Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: PEKING University FOUNDER R & D CENTER

Patentee before: Peking University

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090211