CN103455639A - Method and device for recognizing microblog burst hotspot events - Google Patents

Method and device for recognizing microblog burst hotspot events Download PDF

Info

Publication number
CN103455639A
CN103455639A CN201310452806XA CN201310452806A CN103455639A CN 103455639 A CN103455639 A CN 103455639A CN 201310452806X A CN201310452806X A CN 201310452806XA CN 201310452806 A CN201310452806 A CN 201310452806A CN 103455639 A CN103455639 A CN 103455639A
Authority
CN
China
Prior art keywords
label
topic
degree
probability
sigma
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310452806XA
Other languages
Chinese (zh)
Inventor
崔安颀
张敏
刘奕群
马少平
金奕江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201310452806XA priority Critical patent/CN103455639A/en
Publication of CN103455639A publication Critical patent/CN103455639A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for recognizing microblog burst hotspot events. The method and the device are used for overcoming shortcomings that variation of trends is analyzed without contents, or events are discovered completely on the basis of the contents and the like in the prior art. The method for recognizing the microblog burst hotspot events includes extracting microblog topic labels of all hotspot events and recording the publish time, author information and a hotspot degree of each topic label; computing three measurement values of each topic label; judging whether each hotspot event is a burst event or not according to magnitudes of the corresponding three measurement values of the hotspot event. Each hotspot degree refers to occurrence numbers of the corresponding topic label within different time periods. The three measurement values of each topic label include an instability degree, an online topic possibility degree and label author information entropy.

Description

A kind of method and device of identifying microblogging burst focus incident
Technical field
The present invention relates to network information Intelligent treatment field, relate in particular to a kind of method and device of identifying microblogging burst focus incident.
Background technology
The main approaches of focus incident discovery technique is the modeling to " sudden ", usually, by analyzing the variation of sequential, detects the burst point in information flow.These class methods are only considered the variation tendency of information, from the angle of mathematics, the integral body of information flow are analyzed.Classical works is that the automat of take that Jon Kleinberg sets up information flow (Email, news etc.) is basic hierarchical model, and information flow is carried out to modeling.Along with the time changes, the nominal growth of observed quantity or minimizing make automat enter different states, and the sequence of these state transitions can be built into tree-like Grade Model, therefore by the variation of tracking mode, the generation of burst phenomenon can be detected.On this basis, Ahmed etc. build a figure by the various states of different content, utilize the annexation completion status of node in figure to shift, thereby find the characteristics of different transfer modes, to find the topic of burst.These methods are all to find the burst point in sequence of values.And the people such as Yang are to online Media, particularly the timing variations pattern of micro-blog (patterns of temporal variation) is analyzed, and adopts clustering algorithm, and similar pattern is got together, and can identify different topics.The time series pattern that the difference of this mode and automaton model is to investigate a plurality of elementary cells (word or phrase), carry out cluster to a plurality of patterns, can identify a plurality of burst topics, and not only utilize numerical information.
For investigating the semanteme in accident, the people such as the people such as Figueiredo, Zubiaga are with word feature or metamessage (meta data, as the classification of an information, time etc.) as the tolerance of information, the introducing of these content characteristics can be weighed the degree of diffusion of information more accurately.Further, Tu etc., Pervin etc., Mathioudakis etc., Pavlyshenko people using word as topic, the component units of event, topic and event are to mean on the vector space of word.Under this expression, different examples is carried out to cluster or subject analysis, can obtain content-based topic detection result.These methods all need text is carried out to participle, utilize afterwards word for feature, adopt the methods such as cluster, topic model or frequent-item, form topic.Therefore the expression of any event or topic, the feature of word in the text that places one's entire reliance upon.If text representation is different, the topic of finding so can disperse.Particularly for same event, if the different angles that people talk the matter over, different factor (as time, place, personage etc.), the method for expressing that the word of take is feature can be identified as different topics, makes an event be dispersed into a plurality of events, has reduced the estimation to its influence degree.
In sum, existing focus incident is found the correlative study work present situation, or breaks away from event content, only analyzes numerical value trend and changes; The text of the content that perhaps places one's entire reliance upon (word), do not have dirigibility.
Summary of the invention
(1) technical matters that will solve
The objective of the invention is, a kind of method and device of identifying microblogging burst focus incident is provided, in order to overcome, prior art breaks away from the variation of content analysis trend or the content that places one's entire reliance upon is carried out the deficiencies such as event discovery.
(2) technical scheme
For solving the problems of the technologies described above, the invention provides a kind of method of identifying microblogging burst focus incident, comprising:
Extract the microblog topic label of all focus incidents, and record issuing time, author information and the popular degree of each topic label; Wherein, described popular degree refers to the occurrence number in different time sections;
For described each topic label, calculate three metrics of described each topic label; Wherein, three metrics are respectively instability degree, online topic possibility degree and label author information entropy;
Judge according to the height of described three metrics whether corresponding focus incident is accident.
Preferably, judge according to the height of described three metrics whether corresponding focus incident is that accident comprises:
Judge whether described instability degree is greater than first threshold, online topic possibility degree whether is less than Second Threshold and whether label author information entropy is greater than the 3rd threshold value;
If judge that corresponding focus incident is as accident;
If not, judge that corresponding focus incident is as non-accident.
Preferably, described instability degree is calculated by following formula:
Inst ( hashtag ) = 1 n &Sigma; P ~ ( x ) < p , &ForAll; x Inst ( x ) ,
Wherein, n is for normalized number of days, the time period that language material covers; refer to the probability of occurrence of point of instability x; P refers to the tolerance probability of prior appointment; Inst(x) refer to the instability degree of point of instability x, defined by following formula:
Inst ( x ) = log ( 1 P ~ ( x ) + &epsiv; ) ,
Wherein, ε > 0, be a little real number, for eliminating zero error.
Preferably, described
Figure BDA0000389063710000034
by following formula, calculate:
Figure BDA0000389063710000035
Wherein, 2 μ-x is the symmetric points about μ at x coordinate direction x, and F () means cumulative distribution function, by following formula, calculates:
F ( x ; &mu; , &sigma; ) = 1 &sigma; 2 &pi; &Integral; - &infin; x exp ( - ( t - &mu; ) 2 2 &sigma; 2 ) dt . ,
Wherein, μ is the expectation that distributes, σ 2for variance.
Preferably, described distribution expectation μ, variances sigma 2by following formula, calculate respectively:
&mu; ^ i = X - i = 1 7 &Sigma; k = i - 7 i - 1 x k ,
&sigma; ^ 2 i = S i 2 = 1 7 - 1 &Sigma; k = i - 7 i - 1 ( x k - X - i ) 2 ,
Wherein, μ is
Figure BDA0000389063710000041
σ 2be
Figure BDA0000389063710000042
Preferably, the online topic possibility degree that calculates described topic label comprises:
The use location probability and the character that calculate described topic label form probability;
Described use location probability and character are formed to probability multiplication, obtain the online topic possibility degree of described topic label.
Preferably, the use location probability that calculates described topic label calculates by following formula:
Figure BDA0000389063710000043
Wherein, Pr pos(h) being the use location probability of topic label, function | A| is the element number containing set A.
Preferably, the character formation probability that calculates described topic label calculates by following formula:
Pr word ( h ) = 1 - N L
Wherein, Pr word(h) be the character formation probability of topic label, L is tag length, the word number of N for forming.
Preferably, described label author information entropy calculates by following formula:
Ent ( hashtag ) = - &Sigma; i = 1 k c i n &CenterDot; log ( c i n ) ,
Wherein, Ent (hashtag) means the author information entropy of label hashtag, and k means number, c imean to have issued i bar microblogging, n is ∑ c i.
For solving the problems of the technologies described above, the present invention also adopts another kind of technical scheme: a kind of device of identifying microblogging burst focus incident is provided, comprises:
Extraction unit, for extracting the microblog topic label of all focus incidents, and record issuing time, author information and the popular degree of each topic label; Wherein, described popular degree refers to the occurrence number in different time sections;
Computing unit, for for described each topic label, calculate three metrics of described each topic label; Wherein, three metrics are respectively instability degree, online topic possibility degree and label author information entropy;
Judging unit, judge for the height according to described three metrics whether corresponding focus incident is accident.
Concrete, judging unit comprises:
Judgment sub-unit, for judging whether described instability degree is greater than first threshold, online topic possibility degree whether is less than Second Threshold and whether label author information entropy is greater than the 3rd threshold value;
The first identifying unit, in the situation that described instability degree is greater than first threshold, online topic possibility degree is less than Second Threshold and label author information entropy is greater than the 3rd threshold value, judges that corresponding focus incident is as accident;
The second identifying unit, be used in the situation that described instability degree is not more than first threshold, online topic possibility degree is not less than Second Threshold and label author information entropy is not more than the 3rd threshold value, judges that corresponding focus incident is as non-accident.
(3) beneficial effect
Be different from background technology, the present invention proposes and take the method for the identification microblogging burst focus incident that the microblog topic label is clue.That this invention closely is connected, has with social event for the microblogging focus incident is sudden, propagate several characteristics such as wide, utilizes the reflection of microblog topic label to text subject, has realized the event discovery procedure based on the microblog topic label.The present invention measures three aspects such as information entropy of the instability degree of popular label, online topic possibility degree and label author, take these three dimensions as basis, realize the division to the microblog topic Label space, realized the discovery of focus incident with this by the classification to popular label.With classic method, compare, the present invention has overcome some shortcomings, as break away from content analysis trend and change, or the content that places one's entire reliance upon is carried out the event discovery, and can adopt this reflection content of label not rely on the characteristics of content, utilize the reflection of label to content of text, theme, be not subject to the restriction of vocabulary, language simultaneously, catch the microblogging of describing same event from multi-angle.
The accompanying drawing explanation
Fig. 1 is one of method schematic diagram of the embodiment of the present invention one identification microblogging burst focus incident;
Fig. 2 be the embodiment of the present invention one identification microblogging burst focus incident the method schematic diagram two;
Fig. 3 is three-view diagram and example tag present position in space of Label space of the present invention;
Fig. 4 is one of module diagram of the embodiment of the present invention two identification microblogging burst focus incidents;
Fig. 5 be the embodiment of the present invention two identification microbloggings burst focus incidents module diagram two.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for the present invention is described, but are not used for limiting the scope of the invention.
Embodiment mono-
Refer to Fig. 1, present embodiment provides a kind of method of identifying microblogging burst focus incident.In the present invention, the internet language material can be used as the Data Source that extracts the microblog topic label.At first in step S10, extract the microblog topic label of all focus incidents, and record issuing time, author information and the popular degree of each topic label; Wherein, described popular degree refers to the occurrence number in different time sections.In certain embodiments, the microblog topic label is word or the phrase initial with the # symbol: in the concrete embodiment as Sina's microblogging, and No. # (be a pair of No. # a drawn together microblogging) of microblog topic label for occurring in pairs.In some other embodiment, the microblog topic label can also pass through other label symbol marks.
The microblog topic label has disclosed the theme of microblogging text, event is had to certain susceptibility, popular microblog topic label can reflect hot ticket to a great extent, and we weigh the burst possibility of the corresponding hot ticket of a certain microblog topic label by step S20.
It specifically, in step S20, for described each topic label, calculates three metrics of described each topic label; Wherein, three metrics are respectively instability degree, online topic possibility degree and label author information entropy.In the present invention, the effect of these three metrics is respectively: the instability degree of microblog topic label can be used for weighing the sudden of focus incident, whether online topic possibility degree is popular online topic for the differentiation event, and the label author information drips for removing the microblogging such as commercial paper marketing class.
In the present embodiment, instability degree, online topic possibility degree and label author information entropy mean by Inst, TMP, Ent respectively.Concrete, the microblog topic label hashtag of take is example, below introduces respectively the computation process of above-mentioned three metrics, is specially (1), (2) and (3).
(1) the instability degree of microblog topic label can be used for weighing the sudden of focus incident.In the present embodiment, the instability degree of microblog topic label hashtag is calculated by following formula:
Inst ( hashtag ) = 1 n &Sigma; P ~ ( x ) < p , &ForAll; x Inst ( x ) ,
Wherein, n is for normalized number of days, the time period that language material covers;
Figure BDA0000389063710000072
refer to the probability of occurrence of point of instability x; P refers to the tolerance probability of prior appointment; Inst(x) refer to the instability degree of point of instability x.Above formula mainly means the instability degree Inst(hashtag of a microblog topic label) by the unstable degree Inst(x of a series of point of instability x) cumulative forming.In the present embodiment, Inst(x) by following formula, defined:
Inst ( x ) = log ( 1 P ~ ( x ) + &epsiv; ) ,
Wherein, ε>0, be a little real number, for eliminating except zero error; refer to the probability of occurrence of point of instability x.In above-mentioned preferred embodiment, weigh the probability that some x occurs
Figure BDA0000389063710000078
if calculated by establishing probability model:
Figure BDA0000389063710000075
Wherein, 2 μ-x is the symmetric points of x about μ, and F () means cumulative distribution function, by following formula, calculates:
F ( x ; &mu; , &sigma; ) = 1 &sigma; 2 &pi; &Integral; - &infin; x exp ( - ( t - &mu; ) 2 2 &sigma; 2 ) dt . ,
Wherein, μ is the expectation that distributes, σ 2for variance.
Concrete, when actual computation, the expectation that distributes, variance are the observation sample value (x that adopts 7 skylight openings 1~x 7), expectation, the variance of the 8th day made an estimate, expectation μ, variances sigma distribute 2by following formula, calculate respectively:
&mu; ^ i = X - i = 1 7 &Sigma; k = i - 7 i - 1 x k ,
&sigma; ^ 2 i = S i 2 = 1 7 - 1 &Sigma; k = i - 7 i - 1 ( x k - X - i ) 2 ,
Wherein, μ is
Figure BDA0000389063710000083
σ 2be
Figure BDA0000389063710000084
Based on above-mentioned tolerance, instability degree Inst (hashtag) is higher, means that sudden change (uprushing or the anticlimax) property of label hashtag is stronger, more may reflect the burst focus incident.On the contrary, the instability degree is lower, and label may be more to continue type, and reflection is that continue, stable event.
(2) whether online topic possibility degree is popular online topic for the differentiation event.
As everyone knows, popular online topic and social event have essential distinction, and the operating characteristic of microblog topic label is also different.According to its difference, calculate the online topic possibility degree of described topic label and realize by following steps:
The use location probability and the character that calculate described topic label form probability;
Described use location probability and character are formed to probability multiplication, obtain the online topic possibility degree of described topic label.Concrete computation process is as follows, comprises (21), (22) and (23).
(21) in the present embodiment, the probability that this tolerance of use location probability of topic label appears at beginning of the sentence by topic label h is realized, by following formula, calculates:
Figure BDA0000389063710000085
Wherein, Pr pos(h) being the use location probability of topic label, function | A| is the element number (being microblogging quantity) containing set A.
(22) character of topic label forms this tolerance of probability and realizes in the following manner: recording the microblog topic tag length is L, and the word number of composition is N, and the probability be comprised of the word of these numbers forms probability in order to the character that means the topic label.Concrete, the character that calculates described topic label forms probability and calculates by following formula:
Pr word ( h ) = 1 - N L
Wherein, Pr word(h) be the character formation probability of topic label, L is tag length, the word number of N for forming.In order to understand better this formula, at this, illustrate for example: here according to given English word dictionary, calculate a coupling word number N that label is minimum, for example, dictionary comprises 4 words { art, arthur, thursday, day}, label #arthursday can be divided into: a/r/thursday or arthur/s/day, word number N is 3.
(23) comprehensive above-mentioned calculating, weigh label hashtag and represent that the degree TMP of online topic possibility is: TMP (hashtag)=Pr pos(hashtag) Prword (hashtag).
TMP is higher for this degree value, and the theme of label reflection may be more online topic; Otherwise may be more the microblogging focus incident.
(3) the label author information drips for removing the microblogging such as commercial paper marketing class.
Consider that the user that label relates to is how many, the present invention weighs user's degree of scatter of a certain label of contribution with information entropy.Remember that the related whole microbloggings of label are total to the n bar, the author who issues these microbloggings is total to k people and has issued respectively c 1, c 2..., c kbar microblogging (∑ c i=n), label hashtag author entropy Ent () calculates for passing through following formula:
Ent ( hashtag ) = - &Sigma; i = 1 k c i n &CenterDot; log ( c i n ) ,
Wherein, Ent (hashtag) means the author information entropy of label hashtag, and k means number, c imean i bar microblogging, n is ∑ c i.In the present invention, adopt the degree of scatter tolerance of label author information entropy as label contribution author, entropy is higher, means that the user of corresponding semiotics is overstepping the bounds of propriety loose, affects larger; Otherwise seldom, influence power is limited, does not form important microblogging focus incident for the user who propagates.
Front is high for the burst degree of microblogging focus incident, corresponding to social event, propagate three characteristics such as wide, three tolerance have been proposed, i.e. instability degree Inst, online topic possibility degree TMP and label author's information entropy Ent.Because these three tolerance are separate, therefore the space of microblog topic label can be meaned by these three dimensions, the value of a label on each dimension just reflected its characteristic, discloses its corresponding event type.According to suitable Threshold, each dimension is divided into to height two parts.Wherein, there is the label of higher instability degree, low online topic possibility degree and higher information entropy, represent accident.
In the present embodiment, specifically by step S30, carry out the judgement of accident.In step S30, according to the height of described three metrics, judge whether corresponding focus incident is accident.Refer to Fig. 2, concrete, S30 completes deterministic process by following steps.
S301, judge whether described instability degree is greater than first threshold, online topic possibility degree whether is less than Second Threshold and whether label author information entropy is greater than the 3rd threshold value;
If S302, judge that corresponding focus incident is as accident;
S303, if not, judge that corresponding focus incident is as non-accident.
In order to verify validity and reliability of the present invention, we take the Twitter microblogging and have carried out the related experiment that event is found as application case.Experiment adopts two different time sections, crosses over the different microblogging data set of duration.Data set 1 comprises from the part Twitter micro-blog information in June, 2009 to Dec.Data set 2 comprises from the Twitter micro-blog information in year February in Dec, 2011 to 2012.2,000,000 above bar microbloggings are all arranged the every day of raw data set, through sampling (10,000 of every days), obtain the experimental data collection.The sampling set that comes from 6 months data is designated as Tweet6, and coming from 3 months data samplings is Tweet3.These data sets all comprise the multilingual microbloggings such as English, Japanese, Arabic, Chinese.
At first experiment extracts the popular label (front 1%) of data centralization.Nearly 600 of these labels, quantity accounts for the 0:015% – 0:016% of total amount.From these popular labels, 250 labels of sampling out respectively, amount to 500 and analyzed.Because label itself may have ambiguity, (the non English language text automatic translation is for English) showed in the text sampling that the microblog topic label need to be related to while therefore marking, and asks two makers to be marked the classification of label simultaneously.If annotation results is inconsistent, third maker will be marked this label.If still can not get most results, this label will be dropped.Finally, the Tweets6 data set retains 191 labels, and the Tweets3 data set retains 200 labels.
Assorting process adopts the threshold value of microblog topic label center of distribution point as classification.Instable probability model adopts Gaussian distribution.Existing labeling algorithm relates to angle has nothing in common with each other, and the content normally reflected from label (as music class, sport category etc.) or function (represent descriptor or disclose context) angle is studied, and does not pay close attention to focus incident.The sorting technique of comparatively extensively being quoted (Kwak etc., 2010) in Twitter is selected in experiment.In this sorting technique, label is divided into 4 classes according to the pattern (popularity pattern) of its popularity degree on two dimensions.One of them dimension means that its fashion trend comes from external cause (exogenous) or internal cause (endogenous); Another dimension means that the growth of its trend is in critical point (critical) or sub-critical point (subcritical).Wherein, " external cause+subcritical ", " external cause+critical ", " internal cause+critical " correspond respectively to Twitter tradition idiom, breaking news and continuation news.Because the method does not relate to user profile, therefore there is no classification advertisement marketing class microblogging is distinguished to some extent.To the classification results of this 4 classification problem as following table 1 with as shown in table 2.The document classification algorithm means with Popularity Pattern, and the algorithm that the present invention is based on subspace means with Subspace.The performance data of each classification is listed precision, recall rate and F successively 1value.From the experimental result of form, can find out that most of classification results all can obtain certain effect.Because the microblog topic tag class is various, the nicety of grading of work on hand is generally limited, as
Figure BDA0000389063710000112
in 0.4 left and right, and, for nicety of grading, the recall rate of event class, only have respectively 0.13,0.06 Deng precision in people's document.
Following table (i.e. table 1) is the classification results of microblog topic label in Tweets6, and in table, P, R, F1 mean precision, recall rate and F1 value successively.
Figure BDA0000389063710000111
Table 1
The classification results of microblog topic label in following table (i.e. table 2) Tweets3, in table, P, R, F1 mean precision, recall rate and F1 value successively.
Table 2
Above-mentioned experiment take " on two dimensions, being divided into 4 classes " the present invention is carried out to the principle introduction as example, below in conjunction with concrete data, " on three dimensions, being divided into 8 classes " of the present invention (by three metrics, weighing accidents) set forth.The present invention all has adaptability widely for internet text microblogging language material.But, for the convenience of describing, the Twitter microblogging language material of below take is example, identification focus incident and accident wherein.
The microblogging language material used is as described in testing above, uses respectively the Twitter data set of 6 months and 3 months, and sampling microblogging number is respectively 2,028,794 and 778,395.The microblogging number that wherein comprises label is respectively 212, and 035(10.5%), with 106,327(13.7%), total number of labels is respectively 296,895 and 138,758.
The event recognition algorithm carried out on above-mentioned language material basis, at first calculated three metrics of each label.The result of calculation of some example tag, and residing position is as shown in Figure 3 in Label space.This figure, except embodying the label distribution with temperature, has also indicated two present positions of data concentrated part example tag in Label space.For example, label #sopa(numbers A) instability degree Inst=2.3, online topic possibility degree TMP=0.04, author information entropy Ent=4.9, therefore it is positioned at the upper left corner in the TMP-Inst of three-view diagram view, be arranged in top, the right side () in the Ent-Inst view, be arranged in bottom, the right side () in the Ent-TMP view.
In three dimensions, author information entropy Ent is more responsive to the marketing account.Have the microblog topic label of low entropy, have more concentrated contribution author, usually all propagandized by these accounts, advertisement, marketing characteristic are stronger.Their differences on other two dimensions may cause specifically marketing difference of classification, as described below:
1. have the popular label of low instability degree Inst, quantity is more stable, and the use of label is also comparatively lasting.For example label #abbeydawn is contributed by the robot account of some automatic issuing microblogs, and these microbloggings are all publicizing and supporting Canadian singer Avril Lavigne; Another label #bring1dtonyc is the abbreviation of " bring one day to New York City ", by the robot account, is issued, and is participating in activity that enlarges the city influence power of the U.S..These robot active degrees are not high, operation all the time, though therefore microblogging, the label quantity of issue are not noticeable especially, but comparatively lasting.
2. there is the popular label of higher instability degree, likely embody accident.Yet still only by a small amount of account (propelling movement person) contribution, the event of their representatives is not still true social event due to these labels.As previously mentioned, #property, #praytweets remain by marketing account issue.In addition, label #belieber is also for publicizing the marketing class label of Canadian pop singer Justin Bieber.In fact, this type of marketing originally intention increases singer and its backer's (bean vermicelli) interaction, but the microblogging in data set shows, some marketing accounts are made use of the name of Justin and issued these microbloggings, attempts to allow bean vermicelli concern oneself.This is characteristics of microblogging marketing, but can effectively distinguish this type of marketing content by this method.
Above-mentioned two class labels, have lower author information entropy, according to its instability degree, the kind of advertisement marketing can be divided into to lasting robot releasing advertisements and the sales publicity series advertisements of burst.
Microblog topic label with higher author information entropy, its contribution user is more, affects the crowd large.For distinguishing focus incident and online topic, need to investigate another dimension: online topic possibility degree TMP.For example, the in progress song of label #nowplaying(), #iaintafraidtosay(I ain ' t afraid to say, I do not fear ...) etc. be online topic.It should be noted that in two data sets the instability degree difference of label #nowplaying: in the Tweets6 data set, number of labels increases suddenly in the end one month, and its quantity is comparatively stable in the Tweets3 data set.Therefore in any case, this label only consists of the word of two full spellings, has higher online topic possibility, can identify this label and may be more online topic but not social event.
Other has the label of low online topic possibility, is likely still the traditional custom word in microblogging, #musicmonday that the #followfriday that for example recommends the user to pay close attention to or recommendation user listen to etc.These words, owing to forming tradition, may adopt abbreviation to mean, or appear at the optional position (but not beginning of the sentence) of microblogging text.This operating characteristic and event class label are comparatively close, only adopt aforementioned two kinds of tolerance may not distinguish well this type of label.But their popularity degree has determined the burst degree that it is lower.Therefore, instability degree Inst can distinguish this type of idiom and microblogging focus incident: only have the label of higher instability degree, the microblogging focus incident that just may pay close attention to corresponding to people.These microblog topic labels comprise the #sopa of aforementioned exemplary, also has the abbreviation of #hcr(Health Care Reform, the motion of society of being caused by U.S. medical reform) and #halamadrid(for cheering to Real Madrid football club during football match, particularly during the shooting).
Above-mentioned example explanation is concentrated in real data, and the present invention can identify microblogging burst focus incident and accident.
In conjunction with described above, the present invention proposes and take the method for the identification microblogging burst focus incident that the microblog topic label is clue.That this invention closely is connected, has with social event for the microblogging focus incident is sudden, propagate several characteristics such as wide, utilizes the reflection of microblog topic label to text subject, has realized the event discovery procedure based on the microblog topic label.The present invention measures three aspects such as information entropy of the instability degree of popular label, online topic possibility degree and label author, take these three dimensions as basis, realize the division to the microblog topic Label space, realized the discovery of focus incident with this by the classification to popular label.Experimental result has shown the validity of this invention in multilingual microblogging.In addition, this invention can also identify some advertisement marketing class microbloggings and marketing account, can distinguish popular online topic and true social event.These are all that classic method is not available.With classic method, compare, the present invention has overcome some shortcomings, as break away from content analysis trend and change, or the content that places one's entire reliance upon is carried out the event discovery, and can adopt this reflection content of label not rely on the characteristics of content, utilize the reflection of label to content of text, theme, be not subject to the restriction of vocabulary, language simultaneously, catch the microblogging of describing same event from multi-angle.
Embodiment bis-
Refer to Fig. 4 and Fig. 5, the present embodiment provides a kind of device of identifying microblogging burst focus incident, comprising: extraction unit 41, computing unit 42 and judging unit 43.In the present invention, the internet language material can be used as the Data Source that extracts the microblog topic label.
Extraction unit 41, for extracting the microblog topic label of all focus incidents, and record issuing time, author information and the popular degree of each topic label; Wherein, described popular degree refers to the occurrence number in different time sections.In certain embodiments, the microblog topic label is word or the phrase initial with the # symbol: in the concrete embodiment as Sina's microblogging, and No. # (be a pair of No. # a drawn together microblogging) of microblog topic label for occurring in pairs.In some other embodiment, the microblog topic label can also pass through other label symbol marks.
Computing unit 42, for for described each topic label, calculate three metrics of described each topic label; Wherein, three metrics are respectively instability degree, online topic possibility degree and label author information entropy.In the present invention, the effect of these three metrics is respectively: the instability degree of microblog topic label can be used for weighing the sudden of focus incident, whether online topic possibility degree is popular online topic for the differentiation event, and the label author information drips for removing the microblogging such as commercial paper marketing class.
In the present embodiment, the computation process that computing unit 42 calculates three metrics all, does not repeat them here as shown in the of one as implemented.
Judging unit 43, judge for the height according to described three metrics whether corresponding focus incident is accident.
Concrete, judging unit 43 comprises: judgment sub-unit 431, the first identifying unit 432 and the second identifying unit 433.
Judgment sub-unit 431, for judging whether described instability degree is greater than first threshold, online topic possibility degree whether is less than Second Threshold and whether label author information entropy is greater than the 3rd threshold value.
The first identifying unit 432, in the situation that described instability degree is greater than first threshold, online topic possibility degree is less than Second Threshold and label author information entropy is greater than the 3rd threshold value, judges that corresponding focus incident is as accident.
The second identifying unit 433, be used in the situation that described instability degree is not more than first threshold, online topic possibility degree is not less than Second Threshold and label author information entropy is not more than the 3rd threshold value, judges that corresponding focus incident is as non-accident.
The Data support of each unit of the present embodiment and corresponding technique effect all can be provided by embodiment mono-, do not repeat them here.
The foregoing is only embodiments of the invention; not thereby limit the scope of the claims of the present invention; every equivalent structure or conversion of equivalent flow process that utilizes instructions of the present invention and accompanying drawing content to do; or directly or indirectly be used in other relevant technical fields, all in like manner be included in scope of patent protection of the present invention.

Claims (10)

1. a method of identifying microblogging burst focus incident, is characterized in that, comprising:
Extract the microblog topic label of all focus incidents, and record issuing time, author information and the popular degree of each topic label; Wherein, described popular degree refers to the occurrence number in different time sections;
For described each topic label, calculate three metrics of described each topic label; Wherein, three metrics are respectively instability degree, online topic possibility degree and label author information entropy;
Judge according to the height of described three metrics whether corresponding focus incident is accident.
2. method according to claim 1, is characterized in that, according to the height of described three metrics, judges whether corresponding focus incident is that accident comprises:
Judge whether described instability degree is greater than first threshold, online topic possibility degree whether is less than Second Threshold and whether label author information entropy is greater than the 3rd threshold value;
If judge that corresponding focus incident is as accident;
If not, judge that corresponding focus incident is as non-accident.
3. method according to claim 1, is characterized in that, described instability degree is calculated by following formula:
Inst ( hashtag ) = 1 n &Sigma; P ~ ( x ) < p , &ForAll; x Inst ( x ) ,
Wherein, n is for normalized number of days, the time period that language material covers;
Figure FDA0000389063700000013
refer to the probability of occurrence of point of instability x; P refers to the tolerance probability of prior appointment; Inst(x) refer to the instability degree of point of instability x, defined by following formula:
Inst ( x ) = log ( 1 P ~ ( x ) + &epsiv; ) ,
Wherein, ε > 0, be a little real number, for eliminating zero error.
4. method according to claim 3, is characterized in that, described by following formula, calculate:
Figure FDA0000389063700000021
Wherein, 2 μ-x is the symmetric points about μ at x coordinate direction x, and F () means cumulative distribution function, by following formula, calculates:
F ( x ; &mu; , &sigma; ) = 1 &sigma; 2 &pi; &Integral; - &infin; x exp ( - ( t - &mu; ) 2 2 &sigma; 2 ) dt . ,
Wherein, μ is the expectation that distributes, σ 2for variance.
5. method according to claim 4, is characterized in that, described distribution expectation μ, variances sigma 2by following formula, calculate respectively:
&mu; ^ i = X - i = 1 7 &Sigma; k = i - 7 i - 1 x k ,
&sigma; ^ 2 i = S i 2 = 1 7 - 1 &Sigma; k = i - 7 i - 1 ( x k - X - i ) 2 ,
Wherein, μ is
Figure FDA0000389063700000025
σ 2be
Figure FDA0000389063700000026
6. method according to claim 1, is characterized in that, the online topic possibility degree that calculates described topic label comprises:
The use location probability and the character that calculate described topic label form probability;
Described use location probability and character are formed to probability multiplication, obtain the online topic possibility degree of described topic label.
7. method according to claim 6, is characterized in that, the use location probability that calculates described topic label calculates by following formula:
Figure FDA0000389063700000027
Wherein, Pr pos(h) being the use location probability of topic label, function | A| is the element number containing set A.
8. method according to claim 6, is characterized in that, the character that calculates described topic label forms probability and calculates by following formula:
Pr word ( h ) = 1 - N L ,
Wherein, Pr word(h) be the character formation probability of topic label, L is tag length, the word number of N for forming.
9. method according to claim 1, is characterized in that, described label author information entropy calculates by following formula:
Ent ( hashtag ) = - &Sigma; i = 1 k c i n &CenterDot; log ( c i n ) ,
Wherein, Ent (hashtag) means the author information entropy of label hashtag, and k means number, c imean to have issued i bar microblogging, n is ∑ c i.
10. a device of identifying microblogging burst focus incident, is characterized in that, comprising:
Extraction unit, for extracting the microblog topic label of all focus incidents, and record issuing time, author information and the popular degree of each topic label; Wherein, described popular degree refers to the occurrence number in different time sections;
Computing unit, for for described each topic label, calculate three metrics of described each topic label; Wherein, three metrics are respectively instability degree, online topic possibility degree and label author information entropy;
Judging unit, judge for the height according to described three metrics whether corresponding focus incident is accident.
CN201310452806XA 2013-09-27 2013-09-27 Method and device for recognizing microblog burst hotspot events Pending CN103455639A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310452806XA CN103455639A (en) 2013-09-27 2013-09-27 Method and device for recognizing microblog burst hotspot events

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310452806XA CN103455639A (en) 2013-09-27 2013-09-27 Method and device for recognizing microblog burst hotspot events

Publications (1)

Publication Number Publication Date
CN103455639A true CN103455639A (en) 2013-12-18

Family

ID=49738002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310452806XA Pending CN103455639A (en) 2013-09-27 2013-09-27 Method and device for recognizing microblog burst hotspot events

Country Status (1)

Country Link
CN (1) CN103455639A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965930A (en) * 2015-07-30 2015-10-07 成都布林特信息技术有限公司 Big data based emergency evolution analysis method
CN105373528A (en) * 2015-08-18 2016-03-02 新华网股份有限公司 Method and device for analyzing sensitivity of text contents
CN106464526A (en) * 2014-05-15 2017-02-22 华为技术有限公司 System and method for anomaly detection
CN106933949A (en) * 2017-01-20 2017-07-07 浙江大学 The planing method of influence power outburst in a kind of control social networks
CN107273346A (en) * 2016-03-30 2017-10-20 邻客音公司 To the expansible excavation of popular opinion from text
CN107515889A (en) * 2017-07-03 2017-12-26 国家计算机网络与信息安全管理中心 A kind of microblog topic method of real-time and device
US11106747B2 (en) 2019-06-18 2021-08-31 International Business Machines Corporation Online content management
CN113822069A (en) * 2021-09-17 2021-12-21 国家计算机网络与信息安全管理中心 Emergency early warning method and device based on meta-knowledge and electronic device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
CN102394798A (en) * 2011-11-16 2012-03-28 北京交通大学 Multi-feature based prediction method of propagation behavior of microblog information and system thereof
CN103279479A (en) * 2013-04-19 2013-09-04 中国科学院计算技术研究所 Emergent topic detecting method and system facing text streams of micro-blog platform
CN103294818A (en) * 2013-06-12 2013-09-11 北京航空航天大学 Multi-information fusion microblog hot topic detection method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102214241A (en) * 2011-07-05 2011-10-12 清华大学 Method for detecting burst topic in user generation text stream based on graph clustering
CN102394798A (en) * 2011-11-16 2012-03-28 北京交通大学 Multi-feature based prediction method of propagation behavior of microblog information and system thereof
CN103279479A (en) * 2013-04-19 2013-09-04 中国科学院计算技术研究所 Emergent topic detecting method and system facing text streams of micro-blog platform
CN103294818A (en) * 2013-06-12 2013-09-11 北京航空航天大学 Multi-information fusion microblog hot topic detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANQI CUI,ETC: "Discover Breaking Events with Popular Hashtags in Twitter", 《PROCEEDINGS OF THE 21ST ACM INTERNATIONAL CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT(CIKM)》, 29 October 2012 (2012-10-29) *
PHUVIPADAWAT S,ETC: "Breaking News Detection and Tracking in Twitter", 《PROCEEDINGS OF THE 2010 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY》, 31 August 2010 (2010-08-31), pages 120 - 123, XP031786364 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106464526B (en) * 2014-05-15 2020-02-14 华为技术有限公司 System and method for detecting anomalies
CN106464526A (en) * 2014-05-15 2017-02-22 华为技术有限公司 System and method for anomaly detection
CN104965930A (en) * 2015-07-30 2015-10-07 成都布林特信息技术有限公司 Big data based emergency evolution analysis method
CN104965930B (en) * 2015-07-30 2019-03-26 成都信息工程大学 A kind of emergency event evolution analysis method based on big data
CN105373528A (en) * 2015-08-18 2016-03-02 新华网股份有限公司 Method and device for analyzing sensitivity of text contents
CN105373528B (en) * 2015-08-18 2019-03-12 新华网股份有限公司 A kind of text content sensitive analysis method and device
CN107273346A (en) * 2016-03-30 2017-10-20 邻客音公司 To the expansible excavation of popular opinion from text
CN107273346B (en) * 2016-03-30 2024-06-11 微软技术许可有限责任公司 Extensible mining of trending insights from text
CN106933949A (en) * 2017-01-20 2017-07-07 浙江大学 The planing method of influence power outburst in a kind of control social networks
CN106933949B (en) * 2017-01-20 2020-09-11 浙江大学 Planning method for controlling influence outbreak in social network
CN107515889A (en) * 2017-07-03 2017-12-26 国家计算机网络与信息安全管理中心 A kind of microblog topic method of real-time and device
US11106747B2 (en) 2019-06-18 2021-08-31 International Business Machines Corporation Online content management
CN113822069A (en) * 2021-09-17 2021-12-21 国家计算机网络与信息安全管理中心 Emergency early warning method and device based on meta-knowledge and electronic device
CN113822069B (en) * 2021-09-17 2024-03-12 国家计算机网络与信息安全管理中心 Sudden event early warning method and device based on meta-knowledge and electronic device

Similar Documents

Publication Publication Date Title
Khanam et al. Fake news detection using machine learning approaches
Biyani et al. " 8 amazing secrets for getting more clicks": detecting clickbaits in news streams using article informality
Ahmed et al. Detection of online fake news using n-gram analysis and machine learning techniques
CN103455639A (en) Method and device for recognizing microblog burst hotspot events
Sharma et al. Prediction of Indian election using sentiment analysis on Hindi Twitter
Tartir et al. Semantic sentiment analysis in Arabic social media
CN106598944B (en) A kind of civil aviaton&#39;s security public sentiment sentiment analysis method
Xu et al. Using deep linguistic features for finding deceptive opinion spam
Alsubari et al. [Retracted] Development of Integrated Neural Network Model for Identification of Fake Reviews in E‐Commerce Using Multidomain Datasets
O'Mahony et al. Using readability tests to predict helpful product reviews
Sharma et al. Nlp and machine learning techniques for detecting insulting comments on social networking platforms
CN101599071A (en) The extraction method of conversation text topic
CN106547875B (en) Microblog online emergency detection method based on emotion analysis and label
Dong et al. Financial statement fraud detection using text mining: A systemic functional linguistics theory perspective
CN105183717A (en) OSN user emotion analysis method based on random forest and user relationship
CN104077417A (en) Figure tag recommendation method and system in social network
CN109101489A (en) A kind of text automatic abstracting method, device and a kind of electronic equipment
Cotos et al. Discourse classification into rhetorical functions for AWE feedback
Corallo et al. Sentiment analysis for government: An optimized approach
CN102880631A (en) Chinese author identification method based on double-layer classification model, and device for realizing Chinese author identification method
Barakhnin et al. Methods to identify the destructive information
Yao et al. Online deception detection refueled by real world data collection
Ahiladas et al. Ruchi: Rating individual food items in restaurant reviews
Jang et al. Detecting incongruent news headlines with auxiliary textual information
Wiedemann et al. Computer-assisted text analysis in the social sciences

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20131218

RJ01 Rejection of invention patent application after publication