CN104199974A

CN104199974A - Microblog-oriented dynamic topic detection and evolution tracking method

Info

Publication number: CN104199974A
Application number: CN201410488391.6A
Authority: CN
Inventors: 闫碧莹; 邓攀; 余雷; 赵鑫; 袁伟; 万安格
Original assignee: SINOPARADOFT (BEIJING) PARALLEL SOFTWARE Co Ltd
Current assignee: SINOPARADOFT (BEIJING) PARALLEL SOFTWARE Co Ltd
Priority date: 2013-09-22
Filing date: 2014-09-22
Publication date: 2014-12-10

Abstract

The invention provides a microblog-oriented dynamic topic detecting and evolution tracking method and belongs to the technical field of intelligent information processing. The method includes the steps of 1, establishing a distributed crawler to acquire microblog data; 2, pre-processing the microblog data; 3, performing Chinese word segmentation to remove stop words, and acquiring a word set VOC; 4, subjecting the microblog data to LDA (latent Dirichlet allocation) clustering in each time interval so as to extract latent topics; 5, screening out microblog hot topics in each time interval; 6, subjecting the hot topics of a global time to hierarchical clustering to acquire inter-topic aggregation and differentiation relations; 7, visualizing a topic evolution process according to the inter-topic aggregation and differentiation relations. The method has the advantages such that topic word distribution of an event in different times and a fine-grained topic of a same topic in different times are mined under low time complexity, efficiency is high, and robustness is high; the method has greater practical value.

Description

A kind of Dynamic Theme towards microblogging detects and develops method for tracing

Technical field

The invention belongs to intelligent information processing technology field, be specifically related to a kind of Dynamic Theme towards microblogging and detect and develop method for tracing.

Background technology

Along with the explosive increase of text message on internet, people are more and more difficult to obtain in time interested theme or event information from mass text information.Topic detection and tracking (Topic detecting and tracking, TDT) technology is intended to according to event, language text information flow be organized, and develops a series of core technologies that can meet above user's needs.Topic tracking is one of subtask of TDT, and topic tracking can help people that the information of disperseing is effectively collected and organized, and understands on the whole the full detail of a topic.And topic EVOLUTION ANALYSIS is as the new research direction of TDT, be intended on the basis of topic tracking, the course that the relation in discovery topic between each event and topic develop, clearly represents the ins and outs of topic to user.

The main stream approach that theme develops is at present Dynamic Theme model (Dynamic topic model, be called for short DTM), the main thought of this method is that the model parameter posteriority of current time distributes and introduces model as the condition of next moment model parameter, from the overall situation, whole topic evolutionary model is still graphical model, but more difficult in model parameter derivation.In addition overall processing is made to just can obtain the topic in all moment by a modeling and represent, but do not there is the function of the new text of online interpolation, for newly arrived text again discrete, global modeling.

Dynamically mixture model (Dynamic Mixture Model is called for short DMM), compared with DTM, has stronger time hypothesis.Sequences of text in DTM in each time window can exchange mutually, and text in DMM successively arrives in strict accordance with time sequencing, each moment only arrives one section of text, from this angle, DMM is online topic evolutionary model, can calculate the probability distribution of same theme at different time although DMM is disposable, there are the following problems:

1, too committed memory of three-dimensional matrice space.Distribute because needs solve theme on each time period, therefore need to set up m-word three-dimensional matrice of theme-time, cause EMS memory occupation amount greatly to increase.

2, new text is not re-used, in the time newly arriving text, can only recalculate.

Detection to theme in practical application and the analysis of subject evolution trend all require to carry out in real time, and its difficult point is that the document data amount of processing is very large, and document data type complexity comprises the text of the forms such as news, forum, blog.Above method, all based on specific hypotheses, can only be carried out certain analysis and excavation to a small amount of experimental data, cannot meet the demand of practical application.

Summary of the invention

Defect that cannot analytical calculation subject evolution trend for existing topic detection system, the present invention calculates the similarity relation between theme in different time sections in real time by hierarchical clustering, thereby analyze theme evolution trend in time, and can draw out subject evolution trend figure, specifically propose a kind of Dynamic Theme towards microblogging and detected and develop method for tracing.Concrete steps are:

Step 1, structure distributed reptile, obtain microblogging data.

Step 2, microblogging data are carried out to pre-service.

Pre-service comprises denoising and duplicate removal, specifically refers to that removal microblogging data Chinese version number of words is less than data, the ad content of the data of length threshold L, repetition, automatically replies data and website data, and wherein microblogging data comprise the data in microblogging text and comment.

Step 3, all microblogging data are carried out to Chinese word segmentation, remove stop words, obtain word segmentation result, form set of words VOC.

Step 4, the microblogging data of each time interval are carried out respectively to LDA cluster, extract potential theme.

By microblogging data, comprise text and comment, the set of words forming after Chinese word segmentation is considered as a document, all documents in each time interval are set up to document-topic model, extract theme, and obtain each document and correspond to the probability of different themes and each theme and generate the probability of word.

Step 5, filter out the microblogging topical subject in each time interval.

The microblogging theme obtaining from step 4, to its ownership theme in each time interval of all document calculations, and sort according to the number of documents being divided under theme, choose topical subject.

Step 6: according to LDA cluster result, the topical subject of length of a game is carried out to hierarchical clustering, and obtain polymerization and differentiation relation between each topical subject.

The probability distribution of the theme obtaining according to step 4 on set of words VOC, the topical subject of all time intervals that combining step five obtains is carried out hierarchical clustering in length of a game to topical subject.And then by hierarchical clustering result, calculate differentiation and polymerization situation between each theme.

Step 7: according to the polymerization of theme and differentiation relation, visual theme evolution process.

Advantage of the present invention and good effect are:

(1) detect and develop a method for tracing towards the Dynamic Theme of microblogging, the method has the advantage such as high efficiency, robustness, has larger practical value.

(2) detect and develop a method for tracing towards the Dynamic Theme of microblogging, excavating the fine granularity theme of same theme at different times.

(3) detect and develop a method for tracing towards the Dynamic Theme of microblogging, excavating an event with lower time complexity and distribute in the descriptor of different times.

figure of description

Fig. 1 is that a kind of Dynamic Theme towards microblogging of the present invention detects and the process flow diagram that develops method for tracing.

Fig. 2 is about data are carried out to pretreated flow chart of steps in the present invention.

Fig. 3 is the process flow diagram about Chinese word segmentation in the present invention.

Fig. 4 is the process flow diagram about hierarchical clustering in the present invention.

Fig. 5 is the topical subject schematic diagram of choosing in the present invention.

Fig. 6 is theme evolution design sketch in time in the present invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the present invention is further detailed.

The present invention is based on topic model and respectively the document of different time is carried out to cluster, utilize hierarchical clustering technology to carry out hierarchical clustering to the theme in different time interval, build polymerization, differentiation relation between theme.To choose the popular degree of theme and the influence power point of penetration as theme characteristic research, and the variation tendency of weight analysis topical subject and the high theme of influence power.

Patent 200710062943.7 and the present invention are similar, when difference is its follow-up similarity of asking for different time points, only ask for similarity according to adjacent two themes, if wither away a period of time in the middle of theme, later stage occurs again, cannot catch, and cannot capture differentiation and the aggregation information of theme.

The present invention is the method for subject evolution trend on a kind of automatic calculating microblogging, gathers internet text information, collects the language material of Sina's microblogging, and it is carried out to pre-service; Comprise extraction text message, filter stop words and formula symbol etc.Time segment detects subject events, by distributing at each time point discrete calculation theme, merges the theme result in each stage, carries out hierarchical clustering; Method by hierarchical clustering, the Subject Clustering of different time sheet, completes tracking and evolution analysis to theme, according to the result of hierarchical clustering, and visual subject evolution figure.

Detect and develop a method for tracing towards the Dynamic Theme of microblogging, concrete steps as shown in Figure 1:

Step 1, structure distributed reptile, obtain microblogging data.

Obtain Sina's microblogging data by the API of Sina (Application Programming Interface), the present invention chooses the microblogging data in month, and each microblogging data includes the data of microblogging text and comment.

Step 2, microblogging data are carried out to pre-service.

Pre-service comprises denoising and duplicate removal; Microblogging data are implemented by concrete steps as shown in Figure 2 respectively:

Step 201: remove the data that microblogging data Chinese version number of words is less than length threshold L;

The concrete microblogging data that adopt big or small program automatic fitration microblogging data Chinese version length to be less than length threshold L.Length threshold L value rule of thumb or specific field depend on the circumstances.The present embodiment preferably 5;

Step 202: remove the microblogging data that repeat;

Owing to there being a large amount of repeated microblogging data in microblogging, utilize Bloom filter algorithm or Simhash algorithm to filter the repeating data in microblogging data.

Step 203: remove the ad content comprising in microblogging data;

Set advertising words matching rule base, remove the ad data comprising in microblogging data, in advertising words matching rule base, comprised general conventional advertising words; Write any word of regular expression for match advertisements word matching rule base, regular expression is according to concrete template and fixed.

Step 204: that removes specific reply template Network Based automatically replies microblogging data;

Specific reply template Network Based is set the regular expression mating with network automatic reply content, and that removes specific reply template Network Based in microblogging data automatically replies microblogging data.

Step 205: remove the website data in microblogging data.Set the regular expression of matching web site, remove the website data in microblogging data.

Step 206: repeating step 201, again calculate the length of microblogging data Chinese version number of words, and remove discontented foot length metric microblogging data, carry out secondary cleaning.

Detailed process is as shown in Figure 3:

Step 301: microblogging data are carried out to Chinese word segmentation and remove stop words simultaneously; Call Chinese word segmentation machine microblogging data are carried out to participle, remove stop words simultaneously;

Step 302: the english in microblogging data is carried out to morphological transformation, be transformed into Unified Form;

The english comprising in word segmentation result after step 301 is processed is carried out morphological transformation, is transformed into Unified Form; Comprising that by tense unification be present indefinite simple present, is active voice by voice unification.

Step 303: the document frequency df and the word frequency tf that calculate each word; Be mainly the each word in the word segmentation result that step 302 is obtained, calculate its document frequency df and word frequency tf;

Document frequency df: refer to that the file number that occurred this word is divided by the total number of files in file set;

Word frequency tf: refer to number of times that this word occurs the hereof total word number divided by this file.

Step 304: the characteristic strength ft that calculates each word; Each word in the word segmentation result obtaining for step 302, calculates its characteristic strength ft, and characteristic strength ft is defined as:

ft = \log (\frac{tf}{idf + 1} + 1)

Wherein idf represents inverse document frequency, is the inverse of document frequency df;

Step 305: extract the word that characteristic strength ft is greater than characteristic strength threshold value T, form set of words VOC.

Calculate the characteristic strength ft of gained according to step 304, screening characteristic strength ft is greater than the word of intensity threshold T, the word composition set of words VOC that characteristic strength ft all in microblogging data is greater than to characteristic strength threshold value T, characteristic strength threshold value T determines according to concrete applicable situation.

Step 4: the microblogging data of each time interval are carried out respectively to LDA cluster, extract potential theme.

In the present embodiment, each the microblogging data in set of words VOC are made as to a document d, for arbitrary time interval period, all document d composition collection of document D in it is interval; If each document d all contains n word; Word sequence is made as < w1, w2 ..., wn >, wi represents i word.

All documents in each time interval are set up to document-topic model, obtain theme set T and extract theme, obtain each document and correspond to the probability of different themes and the probability of each theme generation word.

Described document-topic model is chosen the LDA topic model based on Gibbs sampling, in each time interval, collection of document D is at that time carried out to cluster, excavate and be made as < t1, t2 to implicit theme set T, ..., tk >.That extracts themes as topic; The present embodiment is chosen k topic, and ti represents i topic.

Two result vectors that obtain by LDA cluster are expressed as follows:

To arbitrary document d, the probability that corresponds to different topic is θ _d< P _t1, P _t2..., P _tk>, wherein, P _tirepresent the probability of corresponding i the topic of document d.

To any theme topic, the probability that generates various words is wherein, P _wirepresent that topic generates the probability of i word.

Step 5: filter out the microblogging topical subject in each time interval;

Concrete calculation procedure is as follows:

Step 501: to all themes in each time interval, calculate its ownership theme.

Document-theme strength of association threshold value R is set, in the microblogging theme obtaining for step 4, for document d, if it corresponds to the probability P of i topic _tiexceed R, document d belongs to theme ti, and one section of document d may belong to multiple themes simultaneously.

Step 502: all themes are sorted according to the quantity that is divided into the document d under it, get top n theme as topical subject.N distributes and determines according to number of documents under all themes, and N of the present invention preferably 20, is the schematic diagram of 20 topical subject as shown in Figure 5.

As shown in Figure 6, wherein, horizontal ordinate represents the time to front 20 topical subject distribution plan in time, and ordinate represents the theme temperature after LDA cluster.Every corresponding topical subject of curve in figure.As can be seen from the figure each theme is from producing to the process of withering away or having fluctuation to rise and fall therebetween.

Concrete steps are as shown in Figure 4:

Step 601: form the topical subject in each time interval of the overall situation.

Topical subject in the each time interval obtaining according to step 5, merges the topical subject of all time intervals, forms the topical subject in each time interval of the overall situation.

Step 602: each topical subject is carried out to hierarchical clustering, obtain cluster result.

Each theme that extraction step four obtains generates the probability distribution of various words extract the probability distribution of each theme on set of words VOC in each time interval; According to probability distribution, each topical subject all themes in length of a game are carried out to hierarchical clustering, obtain the cluster result of each topical subject all themes in each time interval of the overall situation.

Step 603: according to hierarchical clustering result, binding time information, obtains polymerization and the differentiation relation of topical subject.

The polymerization of topical subject and differentiation relation are as follows:

If theme t1, interior two themes of interval period1, period2 of continuous time before and after t2 is respectively, and t1, t2 belongs to same class, and t2 is considered as being developed by theme t1;

If theme t1, t2 is two themes in time interval period1, and theme t3 is theme in time interval period2, and period1, period2 are intervals continuous time, front and back, and t1, t2, and t3 belongs to same class, and be considered as may be by theme t1 for t3, t2 polymerization;

If theme t1 is theme in time interval period1, theme t2, t3 is two themes in time interval period2, period1, period2 are intervals continuous time, front and back, and t1, t2, t3 belongs to same class, t2, t3 is considered as being differentiated by theme t1;

To obtain polymerization and the differentiation relation of theme, express with topological network form, thereby embody the evolution process of following the trail of topical subject.Utilize HTML5 technology and d3.js data visualization js storehouse to realize dynamic motif polymerization and differentiation, realize the differentiation of topical subject and follow the trail of visual.

Embodiment: adopt following algorithm to carry out hierarchical clustering

Claims

1. detect and develop a method for tracing towards the Dynamic Theme of microblogging, it is characterized in that: comprise the steps:

Step 1, structure distributed reptile, obtain microblogging data;

Step 2, microblogging data are carried out to pre-service;

Step 3, all microblogging data are carried out to Chinese word segmentation, remove stop words, obtain word segmentation result, form set of words VOC;

Step 4: the microblogging data of each time interval are carried out respectively to LDA cluster, extract potential theme;

Microblogging data in set of words VOC are considered as to a document, all documents in each time interval are set up to document-topic model, extract theme, and obtain each document and correspond to the probability of different themes and each theme and generate the probability of word;

Step 5: filter out the microblogging topical subject in each time interval;

Step 6: according to LDA cluster result, the topical subject of length of a game is carried out to hierarchical clustering, and obtain polymerization and differentiation relation between each topical subject;

The probability distribution of the theme obtaining according to step 4 on set of words VOC, the topical subject of all time intervals that combining step five obtains is carried out hierarchical clustering in length of a game to topical subject; And then obtain differentiation and the paradigmatic relation between each theme by hierarchical clustering result;

2. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described pre-service comprises denoising and duplicate removal, specifically refer to that removal microblogging data Chinese version number of words is less than data, the ad content of the data of length threshold L, repetition, automatically replies data and website data, wherein microblogging data comprise the data in microblogging text and comment.

3. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described step 3 specifically comprises:

Step 301: microblogging data are carried out to Chinese word segmentation, remove stop words simultaneously;

Step 303: the document frequency df and the word frequency tf that calculate each word;

Step 304: calculate the characteristic strength ft of each word, characteristic strength ft is defined as:

ft = \log (\frac{tf}{idf + 1} + 1)

Step 305: extract the term vector that characteristic strength ft is greater than characteristic strength threshold value T, form set of words VOC.

4. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described step 5 concrete steps are as follows:

Step 501: to all themes in each time interval, calculate its ownership theme;

Document-theme strength of association threshold value R is set, in the microblogging theme obtaining for step 4, for document d, if it corresponds to the probability P of i topic _tiexceed R, document d belongs to theme ti, and one section of document d can belong to multiple themes simultaneously;

Step 502: all themes are sorted according to the quantity that is divided into the document d under it, get top n theme as topical subject.

5. a kind of Dynamic Theme towards microblogging according to claim 1 detects and develops method for tracing, it is characterized in that, described step 6 specifically comprises:

Step 601: form the topical subject in each time interval of the overall situation;

According to the topical subject in each time interval, merge the topical subject of all time intervals, form the topical subject in each time interval of the overall situation;

Step 602: each topical subject is carried out to hierarchical clustering, obtain cluster result;

Extract the probability distribution that each theme generates various words extract the probability distribution of each theme on set of words VOC in each time interval; According to probability distribution, each topical subject all themes in length of a game are carried out to hierarchical clustering, obtain the cluster result of each topical subject all themes in each time interval of the overall situation;

Step 603: according to hierarchical clustering result, binding time information, obtains polymerization and the differentiation relation of theme;

If theme t1 is theme in time interval period1, theme t2, t3 is two themes in time interval period2, period1, period2 are intervals continuous time, front and back, and t1, t2, t3 belongs to same class, t2, t3 is considered as being differentiated by theme t1.